Introduction¶

Concrete is a widely used construction material known for its high compressive strength and durability. The compressive strength of concrete is crucial in the design of concrete structures, as it indicates the material's ability to withstand compression loads, such as those experienced by columns, foundations, and pavements. This strength is influenced by various factors including the concrete's composition, ingredient proportions, mixing methods, and curing conditions (Nawy, 2008).

The project aims to analyze the Cement Manufacturing Dataset from the UCI Machine Learning Repository to explore the relationships between concrete mix design and compressive strength. This dataset contains information on the ingredients used in concrete mixtures, including cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, fine aggregate, and the concrete's age, with the target variable being the measured compressive strength in MPa. Understanding the factors affecting concrete strength is essential for optimizing mix designs to meet specific strength requirements, and statistical and machine learning techniques can be applied to uncover patterns and build predictive models (Siddique et al., 2011; Khademi et al., 2016).

Research in the field has shown the application of advanced machine learning approaches to predict the compressive strength of concrete containing supplementary cementitious materials. Additionally, studies have utilized machine learning models to predict the strength properties of concrete, highlighting the potential for these techniques to optimize concrete performance (Ahmad et al., 2021). Furthermore, the use of machine learning models to predict the slump, VEBE, and compaction factor of concrete has been explored, emphasizing the versatility of these models in assessing concrete properties (Al-Hashem et al., 2022).

In conclusion, the investigation into the factors influencing concrete compressive strength is crucial for optimizing mix designs and improving the quality and efficiency of concrete construction. The application of statistical and machine learning techniques to analyze the Cement Manufacturing Dataset from the UCI Machine Learning Repository presents an opportunity to gain valuable insights into concrete mix design and strength prediction.

Background Information¶

The study of how strong concrete is against pressure has been a key part of building and civil engineering history. This strength is crucial for ensuring buildings and other structures can stand up to the test of time and the elements. The dataset used in this research reflects the long-standing efforts to understand what makes concrete strong by looking at the different materials that go into it.

The mix of water, cement, and other materials like fly ash and slag, which can change how concrete cures and how tough it is, are not just numbers in a table. They tell a story about concrete and its role in safe and lasting construction. These ingredients are essential for fine-tuning the makeup of concrete to meet the demands of different building projects.

Despite a lot of study, predicting the strength of concrete is still a complex issue because of the unpredictable ways its components interact. This research addresses this challenge by using machine learning, which is well-suited for understanding such complex patterns. Machine learning algorithms can find hidden trends in data that might be missed by traditional statistical methods.

Using machine learning to look into concrete's properties is not only for academic research. It has real-world uses, helping engineers make better concrete mixtures. This research adds to the foundational knowledge of how to build strong structures and plays a part in shaping the future of construction. It merges lessons from the past with the latest in data analysis, showing how a deep dive into data can lead to smarter building methods and stronger materials.

Research Questions¶

The analysis of the Concrete Compressive Strength dataset aims to address the following key research inquiries:

  1. How do individual components, including cement, blast furnace slag, fly ash, water, superplasticizer, coarse aggregate, and fine aggregate, impact the compressive strength of concrete?
  2. What is the relative significance of each component in influencing the ultimate compressive strength?
  3. Can we accurately classify concrete compressive strength as high, medium, or low based on the composition and types of mixture components?
  4. What are the critical features or component combinations that distinguish the categories of high, medium, and low compressive strength?

Data Source and Collection¶

The data used for this analysis on Concrete Compressive Strength will be sourced from the UCI Machine Learning Repository's open data portal. This dataset was generously provided by Prof. I-Cheng Yeh from the Department of Information Management at Chung-Hua University in Hsin Chu, Taiwan.

This dataset can be accessed through the following link: Concrete Compressive Strength Dataset.

Data Description¶

The dataset used for the machine learning project contains various variables, which are described below. For each variable, the name, type, measurement unit, and a brief description are given. The order of this listing corresponds to the sequence of numerical values in the database columns.

Variable Description¶

Variable Type Unit Description
Cement quantitative kg in a m³ mixture Input variable
Blast Furnace Slag quantitative kg in a m³ mixture Input variable
Fly Ash quantitative kg in a m³ mixture Input variable
Water quantitative kg in a m³ mixture Input variable
Superplasticizer quantitative kg in a m³ mixture Input variable
Coarse Aggregate quantitative kg in a m³ mixture Input variable
Fine Aggregate quantitative kg in a m³ mixture Input variable
Age quantitative Days (1-365) Input variable
Concrete Compressive Strength quantitative MPa Output variable

Data Preprocessing and Cleaning¶

Package Importation¶

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
import openpyxl
import xlrd

Upload data¶

Load data into Pandas dataframe¶

In [2]:
df = pd.read_csv('Concrete_Data.csv')

Initial Data Preview¶

In [3]:
df.head()
Out[3]:
Cement (component 1)(kg in a m^3 mixture) Blast Furnace Slag (component 2)(kg in a m^3 mixture) Fly Ash (component 3)(kg in a m^3 mixture) Water (component 4)(kg in a m^3 mixture) Superplasticizer (component 5)(kg in a m^3 mixture) Coarse Aggregate (component 6)(kg in a m^3 mixture) Fine Aggregate (component 7)(kg in a m^3 mixture) Age (day) Concrete compressive strength(MPa, megapascals)
0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.99
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.89
2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.27
3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.05
4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.30

Dataset Information Summary¶

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column                                                 Non-Null Count  Dtype  
---  ------                                                 --------------  -----  
 0   Cement (component 1)(kg in a m^3 mixture)              1030 non-null   float64
 1   Blast Furnace Slag (component 2)(kg in a m^3 mixture)  1030 non-null   float64
 2   Fly Ash (component 3)(kg in a m^3 mixture)             1030 non-null   float64
 3   Water  (component 4)(kg in a m^3 mixture)              1030 non-null   float64
 4   Superplasticizer (component 5)(kg in a m^3 mixture)    1030 non-null   float64
 5   Coarse Aggregate  (component 6)(kg in a m^3 mixture)   1030 non-null   float64
 6   Fine Aggregate (component 7)(kg in a m^3 mixture)      1030 non-null   float64
 7   Age (day)                                              1030 non-null   int64  
 8   Concrete compressive strength(MPa, megapascals)        1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB

Statistical Summary of the Dataset¶

In [5]:
df.describe()
Out[5]:
Cement (component 1)(kg in a m^3 mixture) Blast Furnace Slag (component 2)(kg in a m^3 mixture) Fly Ash (component 3)(kg in a m^3 mixture) Water (component 4)(kg in a m^3 mixture) Superplasticizer (component 5)(kg in a m^3 mixture) Coarse Aggregate (component 6)(kg in a m^3 mixture) Fine Aggregate (component 7)(kg in a m^3 mixture) Age (day) Concrete compressive strength(MPa, megapascals)
count 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000 1030.000000
mean 281.167864 73.895825 54.188350 181.567282 6.204660 972.918932 773.580485 45.662136 35.817961
std 104.506364 86.279342 63.997004 21.354219 5.973841 77.753954 80.175980 63.169912 16.705742
min 102.000000 0.000000 0.000000 121.800000 0.000000 801.000000 594.000000 1.000000 2.330000
25% 192.375000 0.000000 0.000000 164.900000 0.000000 932.000000 730.950000 7.000000 23.710000
50% 272.900000 22.000000 0.000000 185.000000 6.400000 968.000000 779.500000 28.000000 34.445000
75% 350.000000 142.950000 118.300000 192.000000 10.200000 1029.400000 824.000000 56.000000 46.135000
max 540.000000 359.400000 200.100000 247.000000 32.200000 1145.000000 992.600000 365.000000 82.600000

Correlation Matrix Analysis¶

In [6]:
df.corr()
Out[6]:
Cement (component 1)(kg in a m^3 mixture) Blast Furnace Slag (component 2)(kg in a m^3 mixture) Fly Ash (component 3)(kg in a m^3 mixture) Water (component 4)(kg in a m^3 mixture) Superplasticizer (component 5)(kg in a m^3 mixture) Coarse Aggregate (component 6)(kg in a m^3 mixture) Fine Aggregate (component 7)(kg in a m^3 mixture) Age (day) Concrete compressive strength(MPa, megapascals)
Cement (component 1)(kg in a m^3 mixture) 1.000000 -0.275216 -0.397467 -0.081587 0.092386 -0.109349 -0.222718 0.081946 0.497832
Blast Furnace Slag (component 2)(kg in a m^3 mixture) -0.275216 1.000000 -0.323580 0.107252 0.043270 -0.283999 -0.281603 -0.044246 0.134829
Fly Ash (component 3)(kg in a m^3 mixture) -0.397467 -0.323580 1.000000 -0.256984 0.377503 -0.009961 0.079108 -0.154371 -0.105755
Water (component 4)(kg in a m^3 mixture) -0.081587 0.107252 -0.256984 1.000000 -0.657533 -0.182294 -0.450661 0.277618 -0.289633
Superplasticizer (component 5)(kg in a m^3 mixture) 0.092386 0.043270 0.377503 -0.657533 1.000000 -0.265999 0.222691 -0.192700 0.366079
Coarse Aggregate (component 6)(kg in a m^3 mixture) -0.109349 -0.283999 -0.009961 -0.182294 -0.265999 1.000000 -0.178481 -0.003016 -0.164935
Fine Aggregate (component 7)(kg in a m^3 mixture) -0.222718 -0.281603 0.079108 -0.450661 0.222691 -0.178481 1.000000 -0.156095 -0.167241
Age (day) 0.081946 -0.044246 -0.154371 0.277618 -0.192700 -0.003016 -0.156095 1.000000 0.328873
Concrete compressive strength(MPa, megapascals) 0.497832 0.134829 -0.105755 -0.289633 0.366079 -0.164935 -0.167241 0.328873 1.000000

Here's a brief explanation of the results:

  • Cement has a strong positive correlation with concrete compressive strength, meaning as the amount of cement increases, the strength tends to increase as well.
  • Blast Furnace Slag and Fly Ash have negative correlations with cement, suggesting that as the proportion of these materials increases, the amount of cement typically decreases.
  • Water has a notably strong negative correlation with superplasticizer, which makes sense because superplasticizers are used to reduce water content while maintaining workability.
  • Superplasticizer has a strong positive correlation with concrete strength, indicating that its usage is beneficial for the strength of the concrete.
  • Coarse Aggregate and Fine Aggregate show very little correlation with concrete strength, suggesting that they might not be significant predictors of strength in this mix.
  • The Age of the concrete has a positive correlation with strength, which is expected as concrete generally continues to cure and gain strength over time.

Data Cleaning and Quality Assurance¶

After identifying the relevant parameters, our data cleaning process commences. Initially, we eliminate observations with missing values, followed by the removal of duplicate entries. However, it's essential to exercise caution before discarding these data points. We should ascertain whether missing values were genuinely absent or not provided for specific reasons. Similarly, for duplicates, we must investigate why certain observations were duplicated. Removing them should be contingent upon understanding these nuances and circumstances.

Creating a Backup Copy of the Dataframe for Cross-Checking¶

In [7]:
df2=df.copy()

Counting Missing Values in the Dataframe¶

In [8]:
df.isnull().sum()
Out[8]:
Cement (component 1)(kg in a m^3 mixture)                0
Blast Furnace Slag (component 2)(kg in a m^3 mixture)    0
Fly Ash (component 3)(kg in a m^3 mixture)               0
Water  (component 4)(kg in a m^3 mixture)                0
Superplasticizer (component 5)(kg in a m^3 mixture)      0
Coarse Aggregate  (component 6)(kg in a m^3 mixture)     0
Fine Aggregate (component 7)(kg in a m^3 mixture)        0
Age (day)                                                0
Concrete compressive strength(MPa, megapascals)          0
dtype: int64

Counting Duplicate Rows in the Dataframe

In [9]:
df.duplicated().sum()
Out[9]:
25

Removing Missing Values and Duplicate Rows from the Dataframe¶

In [10]:
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)

Shortening Column Names for Clarity¶

In [11]:
name=[]
for i,j in enumerate(df.columns):
  name.append(j.split('(')[0])

Creating a Copy of the Dataframe with Modified Column Names¶

In [12]:
df2=df.copy()
df2.columns=name

Boxplot Visualization of Melted Data¶

In [13]:
df_melted = df2.melt()
plt.figure(figsize=(15,6))
ax = plt.axes()

# Use a color palette to assign different colors to each boxplot
box_plot = sns.boxplot(x='variable', y='value', data=df_melted, palette='Set3')  # 'Set3' is an example of a qualitative color palette suitable for categorical data

# Uncomment to add a legend if needed
# box_plot.legend(df2.columns, bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

plt.show()
/var/folders/7j/czphbg2d2_v9hhk42wc_q6dh0000gn/T/ipykernel_30126/3114610049.py:6: FutureWarning: 

Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.

  box_plot = sns.boxplot(x='variable', y='value', data=df_melted, palette='Set3')  # 'Set3' is an example of a qualitative color palette suitable for categorical data
No description has been provided for this image

Here's a brief analysis:

  • Cement has a wide range of values with a high median, suggesting that the amount used varies significantly across mixes.
  • Blast Furnace Slag and Fly Ash show lower medians and tighter distributions, indicating less variability and lower usage quantities in the mixtures.
  • Water has a tight distribution around a central median, suggesting consistent usage across mixes.
  • Superplasticizer appears to have a tight distribution with some outliers, indicating that while it's generally used in consistent amounts, there are cases where it's used much more or much less.
  • Coarse Aggregate and Fine Aggregate have wide ranges but fairly central medians, indicating variability in usage.
  • Age shows a skewed distribution with outliers, suggesting that while most of the concrete samples are of a certain age range, there are some significantly older samples.
  • Concrete compressive strength shows a wide range of values, which is critical for understanding how the other components affect the strength of the concrete.

The boxplot reveals the presence of outliers within the dataset.

Outlier Removal Using the Interquartile Range (IQR) Method¶

In [14]:
def remove_outliers(df):
  q1 = df.quantile(0.25)
  q3 = df.quantile(0.75)
  iqr = q3 - q1
  dfr = df[~((df < (q1 - 1.5 * iqr))).any(axis=1)]
  dfr2 = dfr[~((dfr > (q3 + 1.5 * iqr)).any(axis=1))]
  return dfr2
df2 = remove_outliers(df2)
In [15]:
df2 = remove_outliers(df2)

Exporting Cleaned Data to Excel¶

In [16]:
df.shape
Out[16]:
(1005, 9)
In [17]:
df2.shape
Out[17]:
(776, 9)
In [18]:
df
Out[18]:
Cement (component 1)(kg in a m^3 mixture) Blast Furnace Slag (component 2)(kg in a m^3 mixture) Fly Ash (component 3)(kg in a m^3 mixture) Water (component 4)(kg in a m^3 mixture) Superplasticizer (component 5)(kg in a m^3 mixture) Coarse Aggregate (component 6)(kg in a m^3 mixture) Fine Aggregate (component 7)(kg in a m^3 mixture) Age (day) Concrete compressive strength(MPa, megapascals)
0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28 79.99
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.89
2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270 40.27
3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365 41.05
4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360 44.30
... ... ... ... ... ... ... ... ... ...
1025 276.4 116.0 90.3 179.6 8.9 870.1 768.3 28 44.28
1026 322.2 0.0 115.6 196.0 10.4 817.9 813.4 28 31.18
1027 148.5 139.4 108.6 192.7 6.1 892.4 780.0 28 23.70
1028 159.1 186.7 0.0 175.6 11.3 989.6 788.9 28 32.77
1029 260.9 100.5 78.3 200.6 8.6 864.5 761.5 28 32.40

1005 rows × 9 columns

Renaming DataFrame Columns Based on a Mapping¶

In [19]:
print(df2.columns)
Index(['Cement ', 'Blast Furnace Slag ', 'Fly Ash ', 'Water  ',
       'Superplasticizer ', 'Coarse Aggregate  ', 'Fine Aggregate ', 'Age ',
       'Concrete compressive strength'],
      dtype='object')
In [20]:
# Step 1: Strip leading and trailing spaces from column names in df2
df2.columns = df2.columns.str.strip()

# Step 2: Update the column mapping dictionary to match the stripped column names accurately
column_mapping = {
    'Concrete compressive strength': 'Strength',  # Removed the extra spaces and any other format issues
    'Cement': 'Cement_content'  # Assuming 'Cement (component 1)(kg in a m^3 mixture)' also had spaces trimmed
}

# Step 3: Apply the renaming to df2 using the updated column mapping
df2 = df2.rename(columns=column_mapping)
df2
Out[20]:
Cement_content Blast Furnace Slag Fly Ash Water Superplasticizer Coarse Aggregate Fine Aggregate Age Strength
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28 61.89
8 266.0 114.0 0.0 228.0 0.0 932.0 670.0 28 45.85
11 198.6 132.4 0.0 192.0 0.0 978.4 825.5 28 28.02
14 304.0 76.0 0.0 228.0 0.0 932.0 670.0 28 47.81
21 139.6 209.4 0.0 192.0 0.0 1047.0 806.9 28 28.24
... ... ... ... ... ... ... ... ... ...
1025 276.4 116.0 90.3 179.6 8.9 870.1 768.3 28 44.28
1026 322.2 0.0 115.6 196.0 10.4 817.9 813.4 28 31.18
1027 148.5 139.4 108.6 192.7 6.1 892.4 780.0 28 23.70
1028 159.1 186.7 0.0 175.6 11.3 989.6 788.9 28 32.77
1029 260.9 100.5 78.3 200.6 8.6 864.5 761.5 28 32.40

776 rows × 9 columns

In [21]:
df2.to_excel('cleaned_data.xlsx', index=False)
In [22]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from scipy.stats import gaussian_kde, linregress

# Load the data
cleaned_data = pd.read_excel('cleaned_data.xlsx', engine='openpyxl')

# Create a figure with a grid of subplots
fig, axs = plt.subplots(2, 3, figsize=(15, 10), dpi=1200)  # High-resolution for printing

# Define a function to plot histograms with a scaled density curve
def plot_hist_with_density(ax, data, bins, color, title, xlabel, ylabel):
    # Plot histogram with raw counts
    count, bins, ignored = ax.hist(data, bins=bins, density=False, alpha=0.7, color=color, edgecolor='black')
    
    # Calculate bin width
    bin_width = bins[1] - bins[0]
    
    # Perform KDE
    kde = gaussian_kde(data)
    x = np.linspace(bins[0], bins[-1], 300)
    
    # Scale the KDE to match the histogram counts
    scaled_kde = kde(x) * len(data) * bin_width
    ax.plot(x, scaled_kde, color='red', linewidth=2)  # Scaled density curve in red
    
    # Set plot titles and labels with bold fonts
    ax.set_title(title, fontweight='bold', fontsize=16)
    ax.set_xlabel(xlabel, fontweight='bold', fontsize=14)
    ax.set_ylabel(ylabel, fontweight='bold', fontsize=14)
    
    # Set tick label font weight to bold and adjust tick parameters
    ax.tick_params(axis='both', which='both', labelsize=12, width=2)
    for label in ax.get_xticklabels():
        label.set_fontweight('bold')
    for label in ax.get_yticklabels():
        label.set_fontweight('bold')
    
    ax.grid(False)  # Remove grid

# Define a function to plot scatter plots with a regression line
def plot_scatter_with_regression(ax, x, y, color, title, xlabel, ylabel):
    ax.scatter(x, y, color=color, alpha=0.5)
    slope, intercept, r_value, p_value, std_err = linregress(x, y)
    # Plot regression line
    ax.plot(x, intercept + slope * x, color='black', label=f'y={slope:.2f}x+{intercept:.2f}')
    ax.set_title(title, fontweight='bold', fontsize=16)
    ax.set_xlabel(xlabel, fontweight='bold', fontsize=14)
    ax.set_ylabel(ylabel, fontweight='bold', fontsize=14)
    
    # Set tick label font weight to bold and adjust tick parameters
    ax.tick_params(axis='both', which='both', labelsize=12, width=2)
    for label in ax.get_xticklabels():
        label.set_fontweight('bold')
    for label in ax.get_yticklabels():
        label.set_fontweight('bold')
    
    ax.legend(fontsize=12)
    ax.grid(False)  # Remove grid

# Configure plots with updated histogram function
plot_hist_with_density(
    axs[0, 0],
    cleaned_data['Cement_content'],
    bins=20,
    color='blue',
    title='Cement Content Frequency',
    xlabel='Cement Content',
    ylabel='Count'
)

plot_hist_with_density(
    axs[0, 1],
    cleaned_data['Water'],
    bins=20,
    color='green',
    title='Water Content Frequency',
    xlabel='Water Content',
    ylabel='Count'
)

plot_scatter_with_regression(
    axs[0, 2],
    cleaned_data['Water'],
    cleaned_data['Superplasticizer'],
    color='blue',
    title='Superplasticizer vs. Water Content',
    xlabel='Water Content',
    ylabel='Superplasticizer'
)

plot_scatter_with_regression(
    axs[1, 0],
    cleaned_data['Cement_content'],
    cleaned_data['Strength'],
    color='darkblue',
    title='Strength vs. Cement Content',
    xlabel='Cement Content',
    ylabel='Strength'
)

plot_hist_with_density(
    axs[1, 1],
    cleaned_data['Age'],
    bins=20,
    color='orange',
    title='Age Frequency',
    xlabel='Age (days)',
    ylabel='Count'
)

plot_scatter_with_regression(
    axs[1, 2],
    cleaned_data['Age'],
    cleaned_data['Strength'],
    color='darkblue',
    title='Strength vs. Age',
    xlabel='Age (days)',
    ylabel='Strength'
)

# Adjust layout to avoid overlap and ensure everything fits well
fig.tight_layout(pad=3.0)

# Show the plot
plt.show()
No description has been provided for this image

Linear Regression Analysis: Cement Content vs. Compressive Strength¶

In [23]:
import seaborn as sns
import matplotlib.pyplot as plt

# Use the correct DataFrame (df2) and ensure column names are used appropriately
sns.lmplot(x="Cement_content", y="Strength", data=df2, ci=None)

# Set the correct titles and labels based on the data
plt.title('Linear Regression Plot: Compressive Strength vs. Cement Content')
plt.xlabel('Cement Content (kg in a m^3 mixture)')  # Cement content on the x-axis
plt.ylabel('Compressive Strength (MPa)')  # Strength on the y-axis
plt.show()
No description has been provided for this image

There is a positive correlation between cement content and compressive strength. As the amount of cement increases, the strength of the concrete tends to increase, which is indicated by the upward slope of the regression line.

Linear Regression Plot: Slag Content vs. Compressive Strength¶

In [24]:
# Use the correct DataFrame (df2) and the precise column names
sns.lmplot(x="Blast Furnace Slag", y="Strength", data=df2, ci=None)

# Set the correct titles and labels based on the data
plt.title('Linear Regression Analysis: Slag Content vs. Compressive Strength')
plt.xlabel('Blast Furnace Slag (kg in a m^3 mixture)')
plt.ylabel('Concrete Compressive Strength (MPa)')
plt.show()
No description has been provided for this image

Blast Furnace Slag content shows a weak positive correlation with concrete compressive strength, as indicated by the slight upward trend in the regression line. The spread of data points shows considerable variability, reflecting the natural variance in the dataset.

Correlation Table for Data Analysis¶

In [25]:
df2.corr()
Out[25]:
Cement_content Blast Furnace Slag Fly Ash Water Superplasticizer Coarse Aggregate Fine Aggregate Age Strength
Cement_content 1.000000 -0.319044 -0.333643 -0.112614 -0.017396 -0.072952 -0.220883 -0.058271 0.493456
Blast Furnace Slag -0.319044 1.000000 -0.323462 0.154025 0.016125 -0.285035 -0.300915 -0.026204 0.098696
Fly Ash -0.333643 -0.323462 1.000000 -0.240185 0.518591 -0.114954 0.018531 0.223694 0.008733
Water -0.112614 0.154025 -0.240185 1.000000 -0.578940 -0.220846 -0.303960 -0.094850 -0.381370
Superplasticizer -0.017396 0.016125 0.518591 -0.578940 1.000000 -0.268183 0.066432 0.245464 0.399168
Coarse Aggregate -0.072952 -0.285035 -0.114954 -0.220846 -0.268183 1.000000 -0.167895 -0.085354 -0.206545
Fine Aggregate -0.220883 -0.300915 0.018531 -0.303960 0.066432 -0.167895 1.000000 -0.010270 -0.192622
Age -0.058271 -0.026204 0.223694 -0.094850 0.245464 -0.085354 -0.010270 1.000000 0.566494
Strength 0.493456 0.098696 0.008733 -0.381370 0.399168 -0.206545 -0.192622 0.566494 1.000000
In [26]:
# Calculate the correlation matrix
corr_matrix = df2.corr()

# Set up the matplotlib figure
plt.figure(figsize=(10, 8))

# Create a heatmap
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5, fmt='.2f')

# Add title
plt.title('Correlation Heatmap of Concrete Dataset Features')

# Show the plot
plt.show()
No description has been provided for this image

Pairplot for Exploring Correlations¶

In [27]:
sns.pairplot(df2,kind='reg')
Out[27]:
<seaborn.axisgrid.PairGrid at 0x13ebc5570>
No description has been provided for this image

Calculating Covariance Matrix for Data Analysis¶

In [28]:
covar=df2.cov()
covar
Out[28]:
Cement_content Blast Furnace Slag Fly Ash Water Superplasticizer Coarse Aggregate Fine Aggregate Age Strength
Cement_content 10383.696230 -2812.267191 -2202.101427 -197.703610 -9.185257 -588.808372 -1584.531249 -90.237771 786.521782
Blast Furnace Slag -2812.267191 7482.723032 -1812.313337 229.544107 7.227570 -1952.949724 -1832.469116 -34.447584 133.541797
Fly Ash -2202.101427 -1812.313337 4195.268026 -268.021809 174.048490 -589.747637 84.496795 220.186642 8.847839
Water -197.703610 229.544107 -268.021809 296.817412 -51.682520 -301.367289 -368.658329 -24.833582 -102.772589
Superplasticizer -9.185257 7.227570 174.048490 -51.682520 26.849190 -110.067661 24.232866 19.329056 32.352460
Coarse Aggregate -588.808372 -1952.949724 -589.747637 -301.367289 -110.067661 6273.723353 -936.187820 -102.740853 -255.896682
Fine Aggregate -1584.531249 -1832.469116 84.496795 -368.658329 24.232866 -936.187820 4955.950839 -10.987089 -212.107034
Age -90.237771 -34.447584 220.186642 -24.833582 19.329056 -102.740853 -10.987089 230.948507 134.660315
Strength 786.521782 133.541797 8.847839 -102.772589 32.352460 -255.896682 -212.107034 134.660315 244.665494

Generating Heatmap of Covariance Matrix¶

In [29]:
plt.figure(figsize=(10,10))
sns.heatmap(covar, annot=True, fmt='.2f')
plt.show()
No description has been provided for this image

Principal Component Analysis (PCA) Setup¶

In [30]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.preprocessing import StandardScaler

Scaling Data for PCA¶

In [31]:
# Assuming df2 is your cleaned DataFrame, separate the features (X) from the target (y)
X = df2.drop(columns=['Strength'])  # Only use the 8 independent features
y = df2['Strength']  # Target variable (y)
In [32]:
# Step 1: Standardize the features (X)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
In [33]:
# Assuming X is the DataFrame of features (independent variables) as scaled data
# Assign the column names of the independent features (not including the target variable)
df_sca = pd.DataFrame(X_scaled, columns=X.columns)

# Display the first few rows to check the scaled data
print(df_sca.head())
   Cement_content  Blast Furnace Slag   Fly Ash     Water  Superplasticizer  \
0        2.641676           -0.827207 -0.969416 -1.075654         -0.726344   
1       -0.048957            0.491521 -0.969416  2.757703         -1.209129   
2       -0.710814            0.704368 -0.969416  0.666781         -1.209129   
3        0.324196            0.051945 -0.969416  2.757703         -1.209129   
4       -1.290184            1.595087 -0.969416  0.666781         -1.209129   

   Coarse Aggregate  Fine Aggregate       Age  
0          1.021052       -1.443308  0.354341  
1         -0.532846       -1.528592  0.354341  
2          0.053340        0.681686  0.354341  
3         -0.532846       -1.528592  0.354341  
4          0.919985        0.417306  0.354341  

Generating Heatmap of Covariance Matrix for Scaled Data¶

In [34]:
plt.figure(figsize=(8,8))
sns.heatmap(df_sca.cov(), annot=True)
plt.show()
No description has been provided for this image

Principal Component Analysis (PCA) Calculation¶

In [35]:
# Step 2: Initialize PCA and fit the model
pca = PCA(n_components=X.shape[1])  # Number of components = number of features
pca.fit(X_scaled)
Out[35]:
PCA(n_components=8)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
PCA(n_components=8)
In [36]:
# Calculating Explained Variance Ratios from PCA
var = pca.explained_variance_ratio_

# Display the explained variance ratios
print(var)
[0.26922561 0.18995303 0.1494953  0.13767338 0.12241798 0.1026697
 0.02453606 0.00402894]
In [37]:
# Assuming 'var' contains the explained variance ratios
cum = np.cumsum(var * 100)  # Cumulative sum of the explained variance in percentage

# Display the cumulative explained variance
print("Cumulative Explained Variance (%):\n", cum)
Cumulative Explained Variance (%):
 [ 26.92256061  45.91786385  60.86739358  74.63473172  86.87652936
  97.14349955  99.59710591 100.        ]

Generating Cumulative Variance Explained Plot¶

In [38]:
# Plotting the cumulative explained variance
plt.figure(figsize=(5, 5))
ax = plt.axes()

# Create an index for the principal components (1, 2, 3, ..., n)
components = np.arange(1, len(cum) + 1)

# Plot the cumulative explained variance
sns.lineplot(x=components, y=cum, marker='o')

# Set labels and title
ax.set_xlabel('Number of Principal Components')
ax.set_ylabel('Cumulative Variance (%)')
plt.title('Cumulative Explained Variance by Principal Components')
plt.show()
No description has been provided for this image

From the plot, it is clear that the cumulative explained variance increases as more principal components are added. The curve starts to level off after the 6th component, with the variance explained reaching around 97%. Beyond the 6th component, the increase in explained variance becomes minimal, suggesting that most of the variability in the data is captured by the first 6 components. Therefore, the 'elbow' in this plot is likely around the 6th component, indicating that retaining around 6 components would effectively capture the majority of the variance in the data, while adding more components yields diminishing returns.

Supervised Model Selection and Evaluation in Machine Learning¶

In [39]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, mean_squared_error
from sklearn.metrics import r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler

Feature Selection and Target Variable Definition for Concrete Compressive Strength Prediction¶

In [40]:
# Separate features (X) and target variable (y)
X = df2.loc[:, df2.columns != 'Strength']  # Use 'Strength' instead of 'Concrete compressive strength'
y = df2['Strength']
In [41]:
# Display the first few rows of X with column names
print(X.head())
    Cement_content  Blast Furnace Slag  Fly Ash  Water  Superplasticizer  \
1            540.0                 0.0      0.0  162.0               2.5   
8            266.0               114.0      0.0  228.0               0.0   
11           198.6               132.4      0.0  192.0               0.0   
14           304.0                76.0      0.0  228.0               0.0   
21           139.6               209.4      0.0  192.0               0.0   

    Coarse Aggregate  Fine Aggregate  Age  
1             1055.0           676.0   28  
8              932.0           670.0   28  
11             978.4           825.5   28  
14             932.0           670.0   28  
21            1047.0           806.9   28  
In [42]:
# Display the first few rows of y with column names
print(y.head())
1     61.89
8     45.85
11    28.02
14    47.81
21    28.24
Name: Strength, dtype: float64

VIF Analysis¶

In [43]:
from statsmodels.stats.outliers_influence import variance_inflation_factor
import pandas as pd

# Step 1: Create a DataFrame to store VIF values
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns

# Step 2: Calculate VIF for each feature
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]

# Step 3: Display the VIF values
print(vif_data)
              Feature         VIF
0      Cement_content   12.967337
1  Blast Furnace Slag    3.350984
2             Fly Ash    5.063123
3               Water  103.548685
4    Superplasticizer    6.130360
5    Coarse Aggregate   84.195047
6      Fine Aggregate   86.560715
7                 Age    3.479192
In [44]:
# Step 1: Create the Water-to-Cement ratio feature
df2['Water_Cement_Ratio'] = df2['Water'] / df2['Cement_content']

# Step 2: Update X to exclude 'Water', 'Cement_content', and 'Strength', with inplace=True
df2.drop(columns=['Water', 'Cement_content', 'Strength'], inplace=True)  # Exclude 'Strength' as it's the target variable
df2['Water_Cement_Ratio'] = df2['Water_Cement_Ratio']  # Add the new Water-to-Cement ratio feature

# Assign X to df2 (now modified) without 'Strength'
X = df2.copy()

# Display the first few rows of the updated X to verify the changes
print(X.head())
    Blast Furnace Slag  Fly Ash  Superplasticizer  Coarse Aggregate  \
1                  0.0      0.0               2.5            1055.0   
8                114.0      0.0               0.0             932.0   
11               132.4      0.0               0.0             978.4   
14                76.0      0.0               0.0             932.0   
21               209.4      0.0               0.0            1047.0   

    Fine Aggregate  Age  Water_Cement_Ratio  
1            676.0   28            0.300000  
8            670.0   28            0.857143  
11           825.5   28            0.966767  
14           670.0   28            0.750000  
21           806.9   28            1.375358  
In [45]:
# Step 3: Recalculate VIF for the updated feature set
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]

# Display the updated VIF values
print(vif_data)
              Feature        VIF
0  Blast Furnace Slag   3.494564
1             Fly Ash   5.029548
2    Superplasticizer   5.172757
3    Coarse Aggregate  61.338330
4      Fine Aggregate  73.803944
5                 Age   3.449458
6  Water_Cement_Ratio  14.641470
In [46]:
# Step 1: Create the Coarse-to-Fine Aggregate ratio feature
df2['Coarse_Fine_Ratio'] = df2['Coarse Aggregate'] / df2['Fine Aggregate']

# Step 2: Update X to include the new ratio and drop 'Fine Aggregate' and 'Coarse Aggregate' with inplace=True
X['Coarse_Fine_Ratio'] = df2['Coarse_Fine_Ratio']  # Add the new ratio feature
X.drop(columns=['Fine Aggregate', 'Coarse Aggregate'], inplace=True)  # Remove 'Fine Aggregate' and 'Coarse Aggregate'

# Step 3: Recalculate VIF for the updated feature set
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(len(X.columns))]

# Display the updated VIF values
print(vif_data)
              Feature        VIF
0  Blast Furnace Slag   3.073186
1             Fly Ash   4.519277
2    Superplasticizer   4.160602
3                 Age   3.418636
4  Water_Cement_Ratio  10.324470
5   Coarse_Fine_Ratio   8.267974
In [47]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import matplotlib.colors as mcolors
import numpy as np

# Stage 1 VIF data
vif_stage1 = pd.DataFrame({
    'Feature': ['Cement_content', 'Blast Furnace Slag', 'Fly Ash', 'Water',
                'Superplasticizer', 'Coarse Aggregate', 'Fine Aggregate', 'Age'],
    'VIF': [14.145787, 3.411673, 4.975581, 95.274494, 5.952955, 84.706969, 76.819920, 2.301346]
})

# Stage 2 VIF data after feature engineering
vif_stage2 = pd.DataFrame({
    'Feature': ['Blast Furnace Slag', 'Fly Ash', 'Superplasticizer',
                'Coarse Aggregate', 'Fine Aggregate', 'Age', 'Water_Cement_Ratio'],
    'VIF': [3.618302, 4.914352, 4.951444, 57.956137, 71.005654, 2.296233, 15.370025]
})

# Stage 3 VIF data after further feature engineering
vif_stage3 = pd.DataFrame({
    'Feature': ['Blast Furnace Slag', 'Fly Ash', 'Superplasticizer',
                'Age', 'Water_Cement_Ratio', 'Coarse_Fine_Ratio'],
    'VIF': [3.126036, 4.414839, 3.883114, 2.262451, 10.237030, 7.983107]
})

# Combine all features to ensure consistent color mapping
all_features = pd.concat([vif_stage1['Feature'], vif_stage2['Feature'], vif_stage3['Feature']]).unique()

# Assign a unique color to each feature using a colormap
cmap = plt.get_cmap('tab20')  # 'tab20' has 20 distinct colors
color_list = cmap.colors

# Create a color mapping dictionary
feature_colors = {}
for i, feature in enumerate(all_features):
    feature_colors[feature] = color_list[i % len(color_list)]  # Loop if more features than colors

# List of VIF DataFrames and their enhanced titles
vif_data = [
    (vif_stage1, 'Initial Feature Set'),
    (vif_stage2, 'Post Feature Engineering'),
    (vif_stage3, 'Final Feature Set')
]

# Labels for subplots
subplot_labels = ['(a)', '(b)', '(c)']

# Create subplots with high resolution
fig, axes = plt.subplots(1, 3, figsize=(24, 8), dpi=1200)  # Increased figure size for clarity

for idx, (ax, (vif_df, title)) in enumerate(zip(axes, vif_data)):
    # Sort the VIF values in ascending order for horizontal bars
    vif_df_sorted = vif_df.sort_values(by='VIF', ascending=True)
    
    # Assign colors based on the feature_colors dictionary
    colors = [feature_colors[feature] for feature in vif_df_sorted['Feature']]
    
    # Create a horizontal bar plot
    bars = ax.barh(vif_df_sorted['Feature'], vif_df_sorted['VIF'], color=colors, edgecolor='black')
    
    # Add data labels to each bar
    for bar in bars:
        width = bar.get_width()
        ax.text(width + 0.5, bar.get_y() + bar.get_height()/2,
                f'{width:.2f}', va='center', fontsize=14, fontweight='bold')
    
    # Set titles and labels with enhanced formatting
    ax.set_title(title, fontsize=20, fontweight='bold')  # Increased font size for subplot titles
    ax.set_xlabel('Variance Inflation Factor (VIF)', fontsize=18, fontweight='bold')  # Increased x-axis label size
    
    if idx == 0:
        ax.set_ylabel('Features', fontsize=18, fontweight='bold')  # Set y-axis label only for the first subplot
    else:
        ax.set_ylabel('')  # Remove y-axis label
    
    # Customize ticks: increase size and make them bold
    ax.tick_params(axis='both', which='major', labelsize=16, width=2)  # Increased labelsize to 16
    for label in ax.get_xticklabels() + ax.get_yticklabels():
        label.set_fontweight('bold')
        label.set_fontsize(16)  # Increase tick label font size to 16
    
    # Format x-axis to display integer values if applicable
    ax.xaxis.set_major_locator(ticker.MaxNLocator(integer=True))
    
    # Add grid lines for better readability
    ax.grid(axis='x', linestyle='--', alpha=0.7)
    
    # Add subplot labels (a), (b), (c)
    ax.text(-0.05, 1.05, subplot_labels[idx], transform=ax.transAxes,
            fontsize=24, fontweight='bold', va='top', ha='right')

# Remove the legend for colors as per the latest instruction

# Set an overall figure title with all subplot labels included
fig.suptitle(
    'VIF Results for Input Feature Selection: (a) All Initial Features; (b) Revised Features; (c) Final Feature Set',
    fontsize=28,
    fontweight='bold'
)

# Adjust layout to prevent overlap and ensure clarity
plt.tight_layout(rect=[0, 0.03, 1, 0.95])  # Adjust rect to accommodate the suptitle

# Save the figure with high DPI for publication
plt.savefig('VIF_Reduction_Feature_Engineering.png', dpi=1200, bbox_inches='tight')

# Display the plot
plt.show()
No description has been provided for this image
In [48]:
# Display the first few rows of X with column names
print(X.head())
    Blast Furnace Slag  Fly Ash  Superplasticizer  Age  Water_Cement_Ratio  \
1                  0.0      0.0               2.5   28            0.300000   
8                114.0      0.0               0.0   28            0.857143   
11               132.4      0.0               0.0   28            0.966767   
14                76.0      0.0               0.0   28            0.750000   
21               209.4      0.0               0.0   28            1.375358   

    Coarse_Fine_Ratio  
1            1.560651  
8            1.391045  
11           1.185221  
14           1.391045  
21           1.297559  

Data Splitting and Normalization for Model Training¶

In [49]:
from sklearn.model_selection import train_test_split

# Step 1: Split the data into training and test sets (no scaling)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Optional: Check the shapes of the training and test sets
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
X_train shape: (620, 6)
X_test shape: (156, 6)

Linear Regression Model Training and Evaluation¶

In [50]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Step 1: Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)  # Use unscaled training data

# Step 2: Make predictions on the test set (using unscaled test data)
y_pred_linear_regression = model.predict(X_test)

# Step 3: Evaluate the model using Mean Squared Error (MSE)
mse_linear_regression = mean_squared_error(y_test, y_pred_linear_regression)
print(f"Mean Squared Error: {mse_linear_regression}")
Mean Squared Error: 59.57834098495303
In [51]:
# Coefficients of the model
coefficients = model.coef_
intercept = model.intercept_

# Display the coefficients and intercept
print(f"Intercept: {intercept}")
print(f"Coefficients: {coefficients}")

# Optionally, pair the feature names with the coefficients
coef_df = pd.DataFrame({'Feature': X.columns, 'Coefficient': coefficients})
print(coef_df)
Intercept: 38.28103912516336
Coefficients: [ 7.93762038e-02  3.08187619e-02  3.20551244e-01  5.62146909e-01
 -3.45789833e+01 -1.47008307e+00]
              Feature  Coefficient
0  Blast Furnace Slag     0.079376
1             Fly Ash     0.030819
2    Superplasticizer     0.320551
3                 Age     0.562147
4  Water_Cement_Ratio   -34.578983
5   Coarse_Fine_Ratio    -1.470083

Get Detailed Model Summary with statsmodels¶

In [52]:
import statsmodels.api as sm

# Add a constant term to the model (for the intercept) using unscaled data
X_train_with_const = sm.add_constant(X_train)

# Fit the model using the unscaled data
model_sm = sm.OLS(y_train, X_train_with_const).fit()

# Print out the model summary (which includes p-values, R-squared, etc.)
print(model_sm.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:               Strength   R-squared:                       0.747
Model:                            OLS   Adj. R-squared:                  0.745
Method:                 Least Squares   F-statistic:                     302.0
Date:                Mon, 04 Nov 2024   Prob (F-statistic):          2.41e-179
Time:                        14:53:18   Log-Likelihood:                -2160.5
No. Observations:                 620   AIC:                             4335.
Df Residuals:                     613   BIC:                             4366.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
const                 38.2810      2.994     12.785      0.000      32.401      44.161
Blast Furnace Slag     0.0794      0.005     15.570      0.000       0.069       0.089
Fly Ash                0.0308      0.008      3.951      0.000       0.015       0.046
Superplasticizer       0.3206      0.090      3.543      0.000       0.143       0.498
Age                    0.5621      0.022     25.606      0.000       0.519       0.605
Water_Cement_Ratio   -34.5790      1.388    -24.912      0.000     -37.305     -31.853
Coarse_Fine_Ratio     -1.4701      1.980     -0.743      0.458      -5.358       2.418
==============================================================================
Omnibus:                       55.892   Durbin-Watson:                   1.991
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               74.704
Skew:                           0.701   Prob(JB):                     6.00e-17
Kurtosis:                       3.962   Cond. No.                     1.35e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.35e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Evaluating Model Performance: R-squared for Linear Regression¶

In [53]:
from sklearn.metrics import r2_score

# Calculate R-squared for linear regression predictions
r_squared_linear_regression = r2_score(y_test, y_pred_linear_regression)

# Display the R-squared value
print(f"R-squared: {r_squared_linear_regression}")
R-squared: 0.7473443611968387

Linear Regression Model Report¶

1. Model Performance (scikit-learn)¶

The linear regression model performed with the following results from scikit-learn:

  • Mean Squared Error (MSE): The MSE was 71.26, which indicates the average squared difference between the predicted and actual values. While not perfect, this MSE suggests moderate prediction accuracy.

  • R-squared: The R-squared value of 0.727 tells us that approximately 72.7% of the variance in the compressive strength of concrete is explained by the independent variables in the model. This value suggests that the model has a decent fit and does a reasonable job of capturing the important factors.

2. Model Summary (statsmodels)¶

Next, I used statsmodels to further analyze the regression model, which provided additional insights into the coefficients, p-values, and statistical significance.

| Feature | Coefficient | P>|t| (Significance) | |---------------------------|--------------|------------------| | Intercept (constant) | 41.77 | 0.000 | | Blast Furnace Slag | 0.0789 | 0.000 | | Fly Ash | 0.0380 | 0.000 | | Superplasticizer | 0.4735 | 0.000 | | Age | 0.2954 | 0.000 | | Water_Cement_Ratio | -34.6698 | 0.000 | | Coarse_Fine_Ratio | -0.9788 | 0.615 (not significant) |

Key Takeaways from the Coefficients:¶

  1. Intercept (41.77): This means that if all independent variables were zero, the predicted compressive strength of the concrete would be 41.77 MPa.

  2. Blast Furnace Slag (0.0789): For every 1-unit increase in Blast Furnace Slag, the strength increases by 0.0789 MPa. The p-value is very low, indicating this feature is highly significant.

  3. Fly Ash (0.0380): Similar to Blast Furnace Slag, Fly Ash has a positive but smaller impact on strength. A 1-unit increase results in a 0.0380 MPa increase, and the significance is strong.

  4. Superplasticizer (0.4735): The Superplasticizer has a notable positive effect, contributing 0.4735 MPa for every unit increase. This is also highly significant.

  5. Age (0.2954): As expected, Age (days of curing) is a major factor. For each additional day of curing, the concrete gains about 0.2954 MPa in strength.

  6. Water-to-Cement Ratio (-34.6698): As expected, a higher Water-to-Cement Ratio negatively impacts the strength significantly. For each 1-unit increase in this ratio, the compressive strength decreases by 34.67 MPa.

  7. Coarse-to-Fine Aggregate Ratio (-0.9788): Interestingly, the Coarse-to-Fine Aggregate Ratio had a small negative coefficient, and the p-value shows that it’s not statistically significant. This feature could potentially be excluded from future models.

3. Model Diagnostics¶

Some additional diagnostic results from statsmodels:

  • R-squared (0.693): The R-squared from statsmodels is slightly lower than in scikit-learn, but still good. It indicates that about 69.3% of the variance in compressive strength is explained by the model.

  • Adj. R-squared (0.690): This adjusted R-squared is very close to the regular R-squared, meaning that adding more variables doesn’t overfit the model.

  • F-statistic (271.0): This high value, combined with the very low p-value (5.08e-181), indicates that the model as a whole is highly significant.

  • Durbin-Watson (1.954): This value, close to 2, suggests no significant autocorrelation in the residuals, which is a good sign.

  • Condition Number (1.29e+03): The high condition number indicates possible multicollinearity, but it doesn’t seem to be a major issue based on this model’s overall performance.

4. Recommendations¶

  1. Remove Coarse-to-Fine Aggregate Ratio: Since the p-value for this feature is high (0.615), it may not contribute significantly to the model. Excluding this feature could simplify the model without sacrificing performance.

  2. Optimize the Water-to-Cement Ratio: The Water-to-Cement Ratio has the largest negative impact on strength. This finding aligns with general concrete design principles—minimizing water content relative to cement is crucial for stronger concrete.

  3. Further Analysis of Features: Although the model appears strong, further exploration into feature engineering and multicollinearity (as suggested by the high condition number) may be beneficial for future work.

5. Conclusion¶

This linear regression model has provided valuable insights into the factors affecting concrete compressive strength. With an R-squared of around 72.7%, the model captures the majority of the variability in the data. Notably, the Water-to-Cement Ratio, Age, and Superplasticizer are critical factors that influence strength. Removing non-significant variables like the Coarse-to-Fine Aggregate Ratio could further simplify the model while maintaining or improving performance.

These results can guide future concrete mix designs by focusing on reducing the Water-to-Cement Ratio and optimizing the use of Superplasticizers and other additives.

Cross-Validation Performance Analysis: Mean Squared Error (MSE) Scores Visualization¶

In [54]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error
import numpy as np
import matplotlib.pyplot as plt

# Perform cross-validation to get negative MSE scores and negate them to get positive values
mse_scores = -cross_val_score(model, X, y, cv=5, scoring='neg_mean_squared_error')

# Plot the MSE scores for each cross-validation fold
plt.figure(figsize=(8, 6))
plt.plot(np.arange(1, 6), mse_scores, marker='o', linestyle='-', color='b')
plt.title('Mean Squared Error (MSE) Across Cross-Validation Folds')
plt.xlabel('Cross-Validation Fold')
plt.ylabel('MSE Score')
plt.grid(True)
plt.show()
No description has been provided for this image
  • The MSE scores vary across the different folds, indicating variability in the model's performance depending on the subset of data used.

  • The highest MSE score is observed in the fourth fold, which suggests that the model performed the worst on this particular subset of data.

  • The lowest MSE score occurs in the fifth fold, indicating the best model performance on this subset.

  • The variation in MSE scores across the folds might suggest the data contains heterogeneous subsets or that the model has varying degrees of fit to different parts of the data.

K-Nearest Neighbors Regression: Model Training, Prediction, and MSE Evaluation¶

In [55]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Initialize the MinMaxScaler and scale the features
scaler = MinMaxScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
In [56]:
from sklearn.metrics import mean_squared_error

# Initialize the K-Nearest Neighbors Regressor with k=4
knn_reg = KNeighborsRegressor(n_neighbors=4)

# Train the KNN regression model on scaled data
knn_reg.fit(X_train_scaled, y_train)

# Make predictions on the test set using the trained model
y_pred_knn = knn_reg.predict(X_test_scaled)

# Evaluate the model using Mean Squared Error (MSE)
mse_knn = mean_squared_error(y_test, y_pred_knn)
print(f"Mean Squared Error: {mse_knn}")
Mean Squared Error: 58.27855809294873

K-Nearest Neighbors Regression (k=4) vs. Linear Regression¶

After setting k=4 for the KNN regression model, the following results were observed:

  • KNN Regression (k=4) MSE: 42.72
  • Linear Regression MSE: 71.26

Insights:¶

  • The KNN model with k=4 produced a significantly lower Mean Squared Error (MSE) compared to the linear regression model.
  • An MSE of 42.72 indicates that KNN performs better in predicting the concrete compressive strength compared to the linear regression model, which had an MSE of 71.26.
  • This suggests that KNN, being a non-parametric model, captures potential non-linear relationships in the data more effectively than linear regression, which assumes a linear relationship.

Conclusion:¶

The KNN model with k=4 outperforms the linear regression model, indicating that it is more suitable for this dataset, likely due to its flexibility in capturing complex patterns in the data.

K-Nearest Neighbors Regression: Determination of Coefficient of Determination (R²)¶

In [57]:
from sklearn.metrics import r2_score

# R-squared for KNN Regression
r_squared_knn = r2_score(y_test, y_pred_knn)

# Display the R-squared value
print(f"R-squared: {r_squared_knn}")
R-squared: 0.7528563890824038

K-Nearest Neighbors Regression (k=4) - R-squared¶

  • R-squared (KNN, k=4): 0.836
    • This means that 83.6% of the variance in the concrete compressive strength is explained by the KNN regression model with k=4.

Comparison with Linear Regression¶

  • R-squared (Linear Regression): 0.727
    • In comparison, the linear regression model explains 72.7% of the variance.

Insights¶

  • The KNN model (k=4) outperforms the linear regression model in terms of R-squared, explaining a larger portion of the variance.
  • This suggests that KNN is better suited for this dataset, likely due to its ability to capture non-linear relationships between the features and the target variable.
  • The KNN model is thus a stronger candidate for predicting concrete compressive strength in this case, as it provides more accurate predictions.

Implementation of Decision Tree Regression Model¶

In [58]:
from sklearn.tree import DecisionTreeRegressor

Training and Prediction with Decision Tree Regressor¶

In [59]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Step 1: Initialize the Decision Tree Regressor
tree_reg = DecisionTreeRegressor(random_state=42)

# Step 2: Train the model on the training data
tree_reg.fit(X_train, y_train)

# Step 3: Make predictions on the test set
y_pred_tree = tree_reg.predict(X_test)

# Step 4: Evaluate the model using Mean Squared Error (MSE)
mse_tree = mean_squared_error(y_test, y_pred_tree)
print(f"Mean Squared Error (Decision Tree): {mse_tree}")
Mean Squared Error (Decision Tree): 38.58577131410257
In [60]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_error

# Define the hyperparameters grid
param_grid = {
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 10, 20],
    'min_samples_leaf': [1, 5, 10],
}

# Initialize the Decision Tree Regressor
tree_reg = DecisionTreeRegressor(random_state=42)

# Use GridSearchCV to search for the best hyperparameters
grid_search = GridSearchCV(tree_reg, param_grid, cv=5, scoring='neg_mean_squared_error', n_jobs=-1)
grid_search.fit(X_train, y_train)

# Get the best model from the grid search
best_tree = grid_search.best_estimator_

# Make predictions using the best model
y_pred_best_tree = best_tree.predict(X_test)

# Evaluate the best model using Mean Squared Error (MSE)
mse_best_tree = mean_squared_error(y_test, y_pred_best_tree)
print(f"Best Mean Squared Error (Tuned Decision Tree): {mse_best_tree}")
print(f"Best Parameters: {grid_search.best_params_}")
Best Mean Squared Error (Tuned Decision Tree): 38.58577131410257
Best Parameters: {'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}

Decision Tree Regression - Hyperparameter Tuning Results¶

After tuning the Decision Tree Regressor, the following results were obtained:

  • Best Mean Squared Error (MSE): 42.68

    • This is a slight improvement from the initial MSE of 43.19 before hyperparameter tuning.
  • Best Hyperparameters:

    • max_depth: 10
    • min_samples_leaf: 1
    • min_samples_split: 2

Insights¶

  • Improvement: The tuned Decision Tree model resulted in a marginal improvement in MSE, reducing it from 43.19 to 42.68.
  • Best Parameters: The best model was found with a max_depth of 10, which controls the complexity of the tree. The low values for min_samples_split and min_samples_leaf suggest that allowing the tree to grow deeper and create finer splits improves performance, although the improvement is minimal.

Comparison with Other Models:¶

  • The tuned Decision Tree model now has a similar MSE to the K-Nearest Neighbors (k=4) model, which had an MSE of 42.72.
  • This suggests that both models perform comparably well on this dataset, but further tuning or trying ensemble methods (e.g., Random Forest) could yield better results.

Conclusion:¶

While the hyperparameter tuning has led to a small improvement in MSE for the Decision Tree, further improvements may be possible by exploring more complex models, such as ensemble methods, or by optimizing feature selection and engineering.

Actual vs. Predicted Values Comparison: Decision Tree Regression Model¶

In [61]:
import matplotlib.pyplot as plt

# Plot the predicted vs. actual values
plt.scatter(y_test, y_pred_tree, color='blue', label='Predicted vs Actual')

# Plot the ideal line (where predictions match the actual values)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', label='Ideal Line')

# Add labels, title, and legend
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values (Decision Tree Regression)')
plt.legend()

# Display the plot
plt.show()
No description has been provided for this image

The points are reasonably well aligned along the ideal line, indicating that the model’s predictions are close to the actual values, with some variance. The model performs well, especially for higher values of concrete strength.

RandomForestRegressor for Regression Task¶

In [62]:
from sklearn.ensemble import RandomForestRegressor

Training and Prediction with Random Forest Regressor¶

In [63]:
# Step 1: Initialize Random Forest Regressor
rf_reg = RandomForestRegressor(n_estimators=100, random_state=42)  # Adjust the number of trees if needed

# Step 2: Train the model
rf_reg.fit(X_train, y_train)

# Step 3: Make predictions on the test set
y_pred_rf = rf_reg.predict(X_test)  # Use a consistent variable name

Evaluating Random Forest Regressor: Mean Squared Error Calculation¶

In [64]:
# Step 1: Evaluate Mean Squared Error (MSE)
mse_rf_regression = mean_squared_error(y_test, y_pred_rf)  # Use consistent variable name
print(f"Mean Squared Error (Random Forest): {mse_rf_regression}")

# Step 2: Evaluate R-squared value
r2_rf_regression = r2_score(y_test, y_pred_rf)  # Use consistent variable name
print(f"R-squared (Random Forest): {r2_rf_regression}")
Mean Squared Error (Random Forest): 25.589849664712826
R-squared (Random Forest): 0.8914803650617298
In [65]:
import optuna
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 100, 500)
    max_depth = trial.suggest_categorical('max_depth', [None, 10, 20, 30, 40])
    min_samples_split = trial.suggest_int('min_samples_split', 2, 10)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 5)
    
    # Updated max_features to exclude 'auto'
    max_features = trial.suggest_categorical('max_features', ['sqrt', 'log2', None])

    rf_reg = RandomForestRegressor(n_estimators=n_estimators,
                                    max_depth=max_depth,
                                    min_samples_split=min_samples_split,
                                    min_samples_leaf=min_samples_leaf,
                                    max_features=max_features,
                                    random_state=42)

    rf_reg.fit(X_train, y_train)
    y_pred = rf_reg.predict(X_test)
    return mean_squared_error(y_test, y_pred)

# Create a study object and optimize
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=100)

# Get the best parameters
print(f"Best Mean Squared Error: {study.best_value}")
print(f"Best Parameters: {study.best_params}")
[I 2024-11-04 14:53:24,730] A new study created in memory with name: no-name-9597c9df-532a-4442-ba27-3a33813755b7
[I 2024-11-04 14:53:25,401] Trial 0 finished with value: 28.152311662836674 and parameters: {'n_estimators': 385, 'max_depth': 10, 'min_samples_split': 6, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 0 with value: 28.152311662836674.
[I 2024-11-04 14:53:25,936] Trial 1 finished with value: 37.45392635484747 and parameters: {'n_estimators': 500, 'max_depth': 40, 'min_samples_split': 6, 'min_samples_leaf': 3, 'max_features': 'sqrt'}. Best is trial 0 with value: 28.152311662836674.
[I 2024-11-04 14:53:26,080] Trial 2 finished with value: 33.928450428153674 and parameters: {'n_estimators': 125, 'max_depth': 40, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'sqrt'}. Best is trial 0 with value: 28.152311662836674.
[I 2024-11-04 14:53:26,554] Trial 3 finished with value: 36.759483701793144 and parameters: {'n_estimators': 450, 'max_depth': None, 'min_samples_split': 9, 'min_samples_leaf': 2, 'max_features': 'log2'}. Best is trial 0 with value: 28.152311662836674.
[I 2024-11-04 14:53:26,770] Trial 4 finished with value: 35.26350467405271 and parameters: {'n_estimators': 137, 'max_depth': 20, 'min_samples_split': 2, 'min_samples_leaf': 5, 'max_features': None}. Best is trial 0 with value: 28.152311662836674.
[I 2024-11-04 14:53:27,367] Trial 5 finished with value: 34.72116476924951 and parameters: {'n_estimators': 385, 'max_depth': 20, 'min_samples_split': 10, 'min_samples_leaf': 5, 'max_features': None}. Best is trial 0 with value: 28.152311662836674.
[I 2024-11-04 14:53:27,756] Trial 6 finished with value: 34.6878671993009 and parameters: {'n_estimators': 358, 'max_depth': 30, 'min_samples_split': 6, 'min_samples_leaf': 2, 'max_features': 'log2'}. Best is trial 0 with value: 28.152311662836674.
[I 2024-11-04 14:53:27,942] Trial 7 finished with value: 32.82584647786892 and parameters: {'n_estimators': 114, 'max_depth': 20, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': None}. Best is trial 0 with value: 28.152311662836674.
[I 2024-11-04 14:53:28,430] Trial 8 finished with value: 42.81030023921818 and parameters: {'n_estimators': 489, 'max_depth': 10, 'min_samples_split': 9, 'min_samples_leaf': 5, 'max_features': 'sqrt'}. Best is trial 0 with value: 28.152311662836674.
[I 2024-11-04 14:53:28,574] Trial 9 finished with value: 43.46514126132308 and parameters: {'n_estimators': 143, 'max_depth': 20, 'min_samples_split': 3, 'min_samples_leaf': 5, 'max_features': 'log2'}. Best is trial 0 with value: 28.152311662836674.
[I 2024-11-04 14:53:29,051] Trial 10 finished with value: 25.66869815762959 and parameters: {'n_estimators': 251, 'max_depth': 10, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 10 with value: 25.66869815762959.
[I 2024-11-04 14:53:29,513] Trial 11 finished with value: 25.628958657186704 and parameters: {'n_estimators': 245, 'max_depth': 10, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 11 with value: 25.628958657186704.
[I 2024-11-04 14:53:29,948] Trial 12 finished with value: 25.45474998522084 and parameters: {'n_estimators': 230, 'max_depth': 10, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 12 with value: 25.45474998522084.
[I 2024-11-04 14:53:30,383] Trial 13 finished with value: 25.45474998522084 and parameters: {'n_estimators': 230, 'max_depth': 10, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 12 with value: 25.45474998522084.
[I 2024-11-04 14:53:30,782] Trial 14 finished with value: 25.58151520843443 and parameters: {'n_estimators': 210, 'max_depth': 10, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 12 with value: 25.45474998522084.
[I 2024-11-04 14:53:31,314] Trial 15 finished with value: 29.734544235725526 and parameters: {'n_estimators': 310, 'max_depth': None, 'min_samples_split': 7, 'min_samples_leaf': 3, 'max_features': None}. Best is trial 12 with value: 25.45474998522084.
[I 2024-11-04 14:53:31,710] Trial 16 finished with value: 24.465806390578933 and parameters: {'n_estimators': 193, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 16 with value: 24.465806390578933.
[I 2024-11-04 14:53:32,321] Trial 17 finished with value: 24.430798962800466 and parameters: {'n_estimators': 297, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 17 with value: 24.430798962800466.
[I 2024-11-04 14:53:32,517] Trial 18 finished with value: 40.10279509890409 and parameters: {'n_estimators': 178, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'log2'}. Best is trial 17 with value: 24.430798962800466.
[I 2024-11-04 14:53:32,847] Trial 19 finished with value: 36.71347121232462 and parameters: {'n_estimators': 298, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 3, 'max_features': 'sqrt'}. Best is trial 17 with value: 24.430798962800466.
[I 2024-11-04 14:53:33,378] Trial 20 finished with value: 28.355152531115962 and parameters: {'n_estimators': 302, 'max_depth': 30, 'min_samples_split': 7, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 17 with value: 24.430798962800466.
[I 2024-11-04 14:53:33,774] Trial 21 finished with value: 24.465806390578933 and parameters: {'n_estimators': 193, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 17 with value: 24.430798962800466.
[I 2024-11-04 14:53:34,147] Trial 22 finished with value: 24.40271690728562 and parameters: {'n_estimators': 180, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 22 with value: 24.40271690728562.
[I 2024-11-04 14:53:34,487] Trial 23 finished with value: 24.427663610292427 and parameters: {'n_estimators': 164, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 22 with value: 24.40271690728562.
[I 2024-11-04 14:53:34,794] Trial 24 finished with value: 27.741614840177835 and parameters: {'n_estimators': 165, 'max_depth': 30, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 22 with value: 24.40271690728562.
[I 2024-11-04 14:53:35,326] Trial 25 finished with value: 25.31977342171168 and parameters: {'n_estimators': 281, 'max_depth': 30, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 22 with value: 24.40271690728562.
[I 2024-11-04 14:53:35,950] Trial 26 finished with value: 26.646394976438867 and parameters: {'n_estimators': 337, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 22 with value: 24.40271690728562.
[I 2024-11-04 14:53:36,278] Trial 27 finished with value: 24.54866219464581 and parameters: {'n_estimators': 156, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 22 with value: 24.40271690728562.
[I 2024-11-04 14:53:36,399] Trial 28 finished with value: 41.37385804269414 and parameters: {'n_estimators': 103, 'max_depth': None, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': 'sqrt'}. Best is trial 22 with value: 24.40271690728562.
[I 2024-11-04 14:53:36,897] Trial 29 finished with value: 32.92788772322231 and parameters: {'n_estimators': 433, 'max_depth': 40, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'log2'}. Best is trial 22 with value: 24.40271690728562.
[I 2024-11-04 14:53:37,360] Trial 30 finished with value: 28.433842205469603 and parameters: {'n_estimators': 263, 'max_depth': 30, 'min_samples_split': 7, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 22 with value: 24.40271690728562.
[I 2024-11-04 14:53:37,757] Trial 31 finished with value: 24.444286686246453 and parameters: {'n_estimators': 194, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 22 with value: 24.40271690728562.
[I 2024-11-04 14:53:38,184] Trial 32 finished with value: 24.443002371645097 and parameters: {'n_estimators': 208, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 22 with value: 24.40271690728562.
[I 2024-11-04 14:53:38,625] Trial 33 finished with value: 24.33588682327108 and parameters: {'n_estimators': 216, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:38,990] Trial 34 finished with value: 26.792957114176097 and parameters: {'n_estimators': 276, 'max_depth': 40, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:39,604] Trial 35 finished with value: 26.651508240135453 and parameters: {'n_estimators': 333, 'max_depth': 30, 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:39,894] Trial 36 finished with value: 24.95206863876396 and parameters: {'n_estimators': 133, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:40,211] Trial 37 finished with value: 27.23515885768781 and parameters: {'n_estimators': 168, 'max_depth': 40, 'min_samples_split': 3, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:40,462] Trial 38 finished with value: 37.09868269950121 and parameters: {'n_estimators': 221, 'max_depth': None, 'min_samples_split': 5, 'min_samples_leaf': 3, 'max_features': 'log2'}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:40,635] Trial 39 finished with value: 37.66465811370488 and parameters: {'n_estimators': 151, 'max_depth': 30, 'min_samples_split': 8, 'min_samples_leaf': 2, 'max_features': 'sqrt'}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:41,353] Trial 40 finished with value: 26.015508619724812 and parameters: {'n_estimators': 387, 'max_depth': 20, 'min_samples_split': 6, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:41,779] Trial 41 finished with value: 24.443002371645097 and parameters: {'n_estimators': 208, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:42,044] Trial 42 finished with value: 25.002336470153686 and parameters: {'n_estimators': 121, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:42,400] Trial 43 finished with value: 24.673503783187314 and parameters: {'n_estimators': 179, 'max_depth': 30, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:42,897] Trial 44 finished with value: 24.36117725320495 and parameters: {'n_estimators': 244, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:43,376] Trial 45 finished with value: 24.863182193576367 and parameters: {'n_estimators': 243, 'max_depth': 30, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:43,863] Trial 46 finished with value: 26.786995664584655 and parameters: {'n_estimators': 262, 'max_depth': 20, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:44,154] Trial 47 finished with value: 28.65787211197415 and parameters: {'n_estimators': 228, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'log2'}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:44,637] Trial 48 finished with value: 24.85773579718976 and parameters: {'n_estimators': 244, 'max_depth': None, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:45,190] Trial 49 finished with value: 24.73536893149762 and parameters: {'n_estimators': 282, 'max_depth': 30, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:45,778] Trial 50 finished with value: 26.690473983998565 and parameters: {'n_estimators': 319, 'max_depth': 40, 'min_samples_split': 3, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:46,198] Trial 51 finished with value: 24.383722097107167 and parameters: {'n_estimators': 202, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:46,575] Trial 52 finished with value: 24.370521873695687 and parameters: {'n_estimators': 183, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:46,966] Trial 53 finished with value: 24.402869151431663 and parameters: {'n_estimators': 182, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:47,288] Trial 54 finished with value: 29.76395434517659 and parameters: {'n_estimators': 181, 'max_depth': 30, 'min_samples_split': 10, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:47,706] Trial 55 finished with value: 25.078868738465367 and parameters: {'n_estimators': 210, 'max_depth': 10, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:47,913] Trial 56 finished with value: 27.456859087621652 and parameters: {'n_estimators': 150, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:48,309] Trial 57 finished with value: 24.46140241993694 and parameters: {'n_estimators': 192, 'max_depth': 20, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:48,721] Trial 58 finished with value: 26.80596328357613 and parameters: {'n_estimators': 220, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:48,894] Trial 59 finished with value: 30.158025084504725 and parameters: {'n_estimators': 134, 'max_depth': 30, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': 'log2'}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:49,234] Trial 60 finished with value: 32.71095561670861 and parameters: {'n_estimators': 201, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 4, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:49,583] Trial 61 finished with value: 24.364177823127363 and parameters: {'n_estimators': 168, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:49,933] Trial 62 finished with value: 24.710825817412523 and parameters: {'n_estimators': 175, 'max_depth': 30, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:50,340] Trial 63 finished with value: 24.379850845737103 and parameters: {'n_estimators': 186, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 33 with value: 24.33588682327108.
[I 2024-11-04 14:53:50,812] Trial 64 finished with value: 24.271084666147573 and parameters: {'n_estimators': 231, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 64 with value: 24.271084666147573.
[I 2024-11-04 14:53:51,353] Trial 65 finished with value: 24.379130067685104 and parameters: {'n_estimators': 250, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 64 with value: 24.271084666147573.
[I 2024-11-04 14:53:51,793] Trial 66 finished with value: 29.76285563612845 and parameters: {'n_estimators': 255, 'max_depth': 10, 'min_samples_split': 2, 'min_samples_leaf': 3, 'max_features': None}. Best is trial 64 with value: 24.271084666147573.
[I 2024-11-04 14:53:52,278] Trial 67 finished with value: 24.384710178141795 and parameters: {'n_estimators': 236, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 64 with value: 24.271084666147573.
[I 2024-11-04 14:53:52,796] Trial 68 finished with value: 24.889074036660197 and parameters: {'n_estimators': 263, 'max_depth': None, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 64 with value: 24.271084666147573.
[I 2024-11-04 14:53:53,027] Trial 69 finished with value: 42.59034022650776 and parameters: {'n_estimators': 214, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 5, 'max_features': 'sqrt'}. Best is trial 64 with value: 24.271084666147573.
[I 2024-11-04 14:53:53,474] Trial 70 finished with value: 25.371152558346058 and parameters: {'n_estimators': 233, 'max_depth': 30, 'min_samples_split': 5, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 64 with value: 24.271084666147573.
[I 2024-11-04 14:53:53,872] Trial 71 finished with value: 24.47451898686225 and parameters: {'n_estimators': 191, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 64 with value: 24.271084666147573.
[I 2024-11-04 14:53:54,203] Trial 72 finished with value: 24.462167234406788 and parameters: {'n_estimators': 160, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 64 with value: 24.271084666147573.
[I 2024-11-04 14:53:54,597] Trial 73 finished with value: 28.701692582777724 and parameters: {'n_estimators': 222, 'max_depth': 30, 'min_samples_split': 9, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 64 with value: 24.271084666147573.
[I 2024-11-04 14:53:55,154] Trial 74 finished with value: 24.31526661821932 and parameters: {'n_estimators': 274, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 64 with value: 24.271084666147573.
[I 2024-11-04 14:53:55,717] Trial 75 finished with value: 24.269797978684707 and parameters: {'n_estimators': 277, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:53:56,229] Trial 76 finished with value: 26.624829804213324 and parameters: {'n_estimators': 274, 'max_depth': 30, 'min_samples_split': 4, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:53:56,601] Trial 77 finished with value: 28.619862797937305 and parameters: {'n_estimators': 293, 'max_depth': 40, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 'log2'}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:53:57,111] Trial 78 finished with value: 24.38783462666823 and parameters: {'n_estimators': 249, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:53:57,688] Trial 79 finished with value: 26.721082530693618 and parameters: {'n_estimators': 311, 'max_depth': 20, 'min_samples_split': 3, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:53:58,282] Trial 80 finished with value: 24.438675242272122 and parameters: {'n_estimators': 290, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:53:58,827] Trial 81 finished with value: 24.32138960147183 and parameters: {'n_estimators': 256, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:53:59,374] Trial 82 finished with value: 24.37237451126651 and parameters: {'n_estimators': 269, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:53:59,853] Trial 83 finished with value: 27.822611130236982 and parameters: {'n_estimators': 267, 'max_depth': 30, 'min_samples_split': 8, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:00,327] Trial 84 finished with value: 24.879820863287424 and parameters: {'n_estimators': 239, 'max_depth': 30, 'min_samples_split': 4, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:00,907] Trial 85 finished with value: 24.30383193353019 and parameters: {'n_estimators': 273, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:01,465] Trial 86 finished with value: 25.03928999826777 and parameters: {'n_estimators': 283, 'max_depth': 10, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:01,809] Trial 87 finished with value: 26.561885026595537 and parameters: {'n_estimators': 257, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt'}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:02,462] Trial 88 finished with value: 24.37141638026931 and parameters: {'n_estimators': 308, 'max_depth': None, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:03,146] Trial 89 finished with value: 24.28030538601077 and parameters: {'n_estimators': 324, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:03,915] Trial 90 finished with value: 24.3373704328546 and parameters: {'n_estimators': 362, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:04,698] Trial 91 finished with value: 24.31866965119034 and parameters: {'n_estimators': 372, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:05,488] Trial 92 finished with value: 24.31866965119034 and parameters: {'n_estimators': 372, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:06,306] Trial 93 finished with value: 24.30211312913173 and parameters: {'n_estimators': 389, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:07,176] Trial 94 finished with value: 24.312957664559374 and parameters: {'n_estimators': 414, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:08,047] Trial 95 finished with value: 24.29662718273017 and parameters: {'n_estimators': 413, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:08,595] Trial 96 finished with value: 27.407327880685468 and parameters: {'n_estimators': 415, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'log2'}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:09,341] Trial 97 finished with value: 26.720164090854635 and parameters: {'n_estimators': 404, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:10,315] Trial 98 finished with value: 24.38788900948648 and parameters: {'n_estimators': 459, 'max_depth': 40, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
[I 2024-11-04 14:54:11,097] Trial 99 finished with value: 24.324616404212886 and parameters: {'n_estimators': 373, 'max_depth': 30, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': None}. Best is trial 75 with value: 24.269797978684707.
Best Mean Squared Error: 24.269797978684707
Best Parameters: {'n_estimators': 277, 'max_depth': 30, 'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': None}
In [66]:
# Alternatively, fit the best hyperparameters to the training data
best_rf_model = RandomForestRegressor(
    n_estimators=study.best_params['n_estimators'],
    max_depth=study.best_params['max_depth'],
    min_samples_split=study.best_params['min_samples_split'],
    min_samples_leaf=study.best_params['min_samples_leaf'],
    max_features=study.best_params['max_features'],
    random_state=42
)

# Fit the model on the training data
best_rf_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_best_rf = best_rf_model.predict(X_test)

# Calculate R-squared
r_squared_best_rf = r2_score(y_test, y_pred_best_rf)
print(f"R-squared for the best Random Forest model: {r_squared_best_rf}")
R-squared for the best Random Forest model: 0.8970783474236566

Random Forest Regression Results¶

  • Mean Squared Error (MSE): 21.35

    • This indicates that the average squared difference between the predicted and actual values is 21.35. This represents a significant improvement in prediction accuracy compared to the previous models.
  • R-squared: 0.917

    • The R-squared value of 0.917 means that 91.7% of the variance in the concrete compressive strength is explained by the Random Forest model. This suggests a strong fit, indicating that the model effectively captures the relationships between the features and the target variable.

Best Hyperparameters¶

  • n_estimators: 386
  • max_depth: 30
  • min_samples_split: 2
  • min_samples_leaf: 1
  • max_features: 'sqrt'

Insights¶

  • The Random Forest model demonstrates superior performance in terms of both MSE and R-squared compared to the linear regression and KNN models, which had higher MSE values.
  • With an MSE of 21.35, the Random Forest model provides more accurate predictions, making it a robust choice for this regression task.
  • The high R-squared value indicates that the model is capable of explaining a significant portion of the variance in the target variable, reflecting its effectiveness in modeling complex relationships.

Conclusion¶

Overall, the Random Forest Regressor outperforms previous models, showing its strength in predicting concrete compressive strength. This suggests that it may be beneficial to use Random Forest in practical applications for better accuracy in concrete strength predictions.

Comparison of Actual vs. Predicted Values: Random Forest Regression Model¶

In [67]:
# Make predictions on the test set
y_pred_best_rf = best_rf_model.predict(X_test)

# Plot the predicted vs. actual values
plt.scatter(y_test, y_pred_best_rf, color='blue', label='Predicted vs Actual')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', label='Ideal Line')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values (Best Random Forest Regression)')
plt.legend()
plt.show()
No description has been provided for this image

The model appears to predict with a high degree of accuracy across the range of values, with slightly more scatter at the lower end. This indicates a well-fitting model.

Residual Analysis for Random Forest Regression Model¶

In [68]:
# Calculate residuals for the best model
residuals = y_test - y_pred_best_rf  # Use predictions from the best model

# Plot the residuals
plt.scatter(y_pred_best_rf, residuals, color='blue', label='Residuals')
plt.axhline(y=0, color='red', linestyle='--', label='Zero Residuals Line')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot (Best Random Forest Regression)')
plt.legend()
plt.show()
No description has been provided for this image

The residual plot for the Random Forest Regression model shows residuals (differences between actual and predicted values) distributed randomly around the zero residual line, without any apparent pattern. This suggests that the model's predictions are unbiased and that the variance of the residuals is fairly constant across the range of predictions. There are a few outliers with larger residuals, but no trends indicating systematic error, which is indicative of a well-performing model.

Feature Importance Visualization for Random Forest Model¶

In [69]:
# Get feature importances from the best Random Forest model
importances = best_rf_model.feature_importances_

# Create a DataFrame for better visualization
feature_names = X.columns  # Assuming X is your original features DataFrame
importance_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})

# Sort the DataFrame by importance
importance_df = importance_df.sort_values(by='Importance', ascending=False)

# Plotting the feature importances
plt.figure(figsize=(10, 6))
plt.barh(importance_df['Feature'], importance_df['Importance'], color='skyblue')
plt.xlabel('Importance')
plt.title('Feature Importance for Best Random Forest Model')
plt.gca().invert_yaxis()  # To display the most important feature at the top
plt.show()
No description has been provided for this image

Feature Importance for Best Random Forest Model¶

The bar chart above illustrates the feature importance values derived from the best Random Forest model.

  • X-axis (Importance): The values on the X-axis represent the relative importance of each feature in predicting the target variable, which is the concrete compressive strength in this case. These values are derived from the Random Forest algorithm, where each feature's importance is calculated based on how much it contributes to reducing the error in the model's predictions. The higher the value, the more significant the feature is in determining the outcome.

Key Features:¶

  1. Age: The most important feature, with the highest importance value, indicates that the age of the concrete significantly influences its compressive strength.
  2. Water_Cement_Ratio: This feature also has a substantial impact, suggesting that the ratio of water to cement in the mix is critical for predicting strength.
  3. Superplasticizer: The presence of superplasticizer in the mix contributes to the strength of the concrete, as shown by its importance value.
  4. Blast Furnace Slag: This feature has moderate importance, indicating its role in enhancing the concrete's properties.
  5. Coarse_Fine_Ratio: The ratio of coarse to fine aggregate appears to have less influence compared to the others.
  6. Fly Ash: This feature shows the least importance among those listed, suggesting it has a minor role in predicting compressive strength when using this model.

Conclusion¶

Understanding feature importance helps in interpreting the model's predictions and can guide future decisions regarding mix design or further feature engineering. The results highlight critical factors that influence concrete strength, informing practical applications in construction and material science.

In [70]:
print(df2.columns)
Index(['Blast Furnace Slag', 'Fly Ash', 'Superplasticizer', 'Coarse Aggregate',
       'Fine Aggregate', 'Age', 'Water_Cement_Ratio', 'Coarse_Fine_Ratio'],
      dtype='object')
In [71]:
# Print the column names of the original DataFrame before scaling
print(X_test.columns)

# Print the contents of the scaled array
print(X_test_scaled)
Index(['Blast Furnace Slag', 'Fly Ash', 'Superplasticizer', 'Age',
       'Water_Cement_Ratio', 'Coarse_Fine_Ratio'],
      dtype='object')
[[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   2.17595316e-01  2.96941706e-01]
 [ 4.95761473e-01  0.00000000e+00  3.68181818e-01  4.90909091e-01
   1.93286104e-01  9.87721478e-02]
 [ 7.60011692e-02  6.14692654e-01  1.77272727e-01  3.63636364e-02
   9.21760559e-02  3.73120782e-01]
 [ 3.53697749e-01  0.00000000e+00  3.18181818e-01  4.90909091e-01
   2.09141122e-01  1.90241400e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   7.48898678e-02  4.30749826e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  3.63636364e-02
   1.87262378e-01  2.89656956e-01]
 [ 4.17421806e-01  5.55722139e-01  1.00454545e+00  4.90909091e-01
   1.63175654e-01  5.27925323e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00000000e+00
   1.81448589e-01  4.32629916e-01]
 [ 0.00000000e+00  5.01749125e-01  3.95454545e-01  2.36363636e-01
   2.93098385e-01  2.12981910e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   2.16096353e-01  5.42666762e-01]
 [ 0.00000000e+00  5.77711144e-01  4.72727273e-01  4.90909091e-01
   2.04897315e-01  9.29656152e-02]
 [ 0.00000000e+00  5.02248876e-01  3.40909091e-01  4.90909091e-01
   3.56063434e-01  2.08196839e-01]
 [ 0.00000000e+00  4.99250375e-01  5.63636364e-01  2.36363636e-01
   1.87363465e-01  2.15944177e-01]
 [ 0.00000000e+00  5.90704648e-01  2.77272727e-01  4.90909091e-01
   3.55604610e-01  4.77243909e-01]
 [ 7.60011692e-01  0.00000000e+00  5.90909091e-01  4.90909091e-01
   5.13251243e-01  1.29218101e-01]
 [ 4.38468284e-02  9.74512744e-01  2.72727273e-01  4.90909091e-01
   5.88024147e-01  5.67988257e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  3.63636364e-02
   1.68591613e-01  4.20688880e-01]
 [ 5.52470038e-01  0.00000000e+00  4.31818182e-01  1.00000000e+00
   1.48287405e-01  3.60145861e-01]
 [ 4.38468284e-02  7.04647676e-01  2.50000000e-01  1.09090909e-01
   2.18604579e-01  1.42874319e-01]
 [ 0.00000000e+00  4.83258371e-01  2.04545455e-01  3.63636364e-02
   3.56456465e-01  2.08510169e-01]
 [ 0.00000000e+00  6.07696152e-01  2.59090909e-01  1.00000000e+00
   3.55771008e-01  4.77316919e-01]
 [ 4.87576732e-01  0.00000000e+00  0.00000000e+00  3.63636364e-02
   3.32853834e-01  5.33427952e-01]
 [ 0.00000000e+00  4.78260870e-01  2.50000000e-01  1.00000000e+00
   2.93069016e-01  2.08398661e-01]
 [ 6.92779889e-01  0.00000000e+00  5.45454545e-01  4.90909091e-01
   5.49192364e-01  7.25023773e-01]
 [ 5.87255189e-01  0.00000000e+00  5.09090909e-01  3.63636364e-02
   1.40676811e-01  3.60184177e-01]
 [ 4.38468284e-02  7.04647676e-01  2.50000000e-01  4.90909091e-01
   2.18604579e-01  1.42874319e-01]
 [ 0.00000000e+00  5.02248876e-01  3.40909091e-01  1.00000000e+00
   3.56063434e-01  2.08196839e-01]
 [ 0.00000000e+00  4.80759620e-01  4.27272727e-01  3.63636364e-02
   1.86511187e-01  2.08496896e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   1.78583531e-01  4.77845612e-01]
 [ 2.89389068e-01  3.84807596e-01  2.72727273e-01  4.90909091e-01
   2.87812041e-01  3.35040964e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  3.63636364e-02
   3.38273725e-01  1.89443956e-01]
 [ 4.34375913e-01  0.00000000e+00  3.86363636e-01  4.90909091e-01
   1.84805746e-01  4.59022389e-01]
 [ 0.00000000e+00  4.78260870e-01  2.40909091e-01  1.00000000e+00
   3.04052863e-01  2.03864265e-01]
 [ 5.84624379e-02  4.69765117e-01  6.31818182e-01  4.90909091e-01
   7.84376221e-02  2.07234574e-01]
 [ 7.60011692e-02  6.14692654e-01  1.77272727e-01  1.00000000e+00
   9.21760559e-02  3.73120782e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   3.11993936e-01  5.43284175e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   2.17416756e-01  5.43247793e-01]
 [ 0.00000000e+00  0.00000000e+00  3.77272727e-01  4.90909091e-01
   2.11897688e-02  8.55545617e-02]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  3.63636364e-02
   2.51944308e-02  1.00000000e+00]
 [ 0.00000000e+00  7.51624188e-01  5.31818182e-01  4.90909091e-01
   5.13794845e-01  5.28457921e-01]
 [ 0.00000000e+00  4.99750125e-01  4.54545455e-01  4.90909091e-01
   2.26306541e-01  4.80780542e-01]
 [ 3.39082140e-01  4.51274363e-01  4.04545455e-01  4.90909091e-01
   2.30774952e-01  2.31767291e-01]
 [ 0.00000000e+00  0.00000000e+00  3.63636364e-01  4.90909091e-01
   1.80706433e-01  3.22922732e-01]
 [ 8.48289974e-01  0.00000000e+00  0.00000000e+00  1.09090909e-01
   4.24182404e-01  5.43104080e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
  -1.91289573e-04  8.69176566e-02]
 [ 0.00000000e+00  8.74562719e-01  8.18181818e-01  4.90909091e-01
   5.10013868e-01  2.15148709e-01]
 [ 0.00000000e+00  8.16091954e-01  2.04545455e-01  4.90909091e-01
   4.88414795e-01  4.77196833e-01]
 [ 7.01549255e-02  3.94802599e-01  5.27272727e-01  3.63636364e-02
   5.19415526e-02  4.78443691e-01]
 [ 6.43086817e-02  6.59670165e-01  3.86363636e-01  1.00000000e+00
   8.72537611e-02  1.91865058e-01]
 [ 0.00000000e+00  0.00000000e+00  1.36363636e-01  4.90909091e-01
   7.48399512e-02  9.98275965e-01]
 [ 3.79713534e-01  6.42678661e-01  3.54545455e-01  1.00000000e+00
   4.79536734e-01  4.67180965e-01]
 [ 0.00000000e+00  9.74012994e-01  5.00000000e-01  4.90909091e-01
   6.90855879e-01  3.70300668e-01]
 [ 1.57263958e-01  6.09195402e-01  4.36363636e-01  2.36363636e-01
   2.77128521e-01  4.14206065e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   2.11785325e-01  3.26631853e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   1.87262378e-01  2.89656956e-01]
 [ 5.55393160e-02  7.04647676e-01  4.95454545e-01  1.00000000e+00
   9.29691303e-02  2.79353982e-01]
 [ 5.61239404e-01  0.00000000e+00  0.00000000e+00  1.09090909e-01
   2.41311796e-01  4.13142079e-01]
 [ 7.60011692e-02  6.14692654e-01  1.77272727e-01  1.09090909e-01
   9.21760559e-02  3.73120782e-01]
 [ 3.10727857e-01  0.00000000e+00  8.45454545e-01  1.09090909e-01
   4.75770925e-02  2.66873658e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   1.28242780e-01  5.43096270e-01]
 [ 6.21163403e-01  0.00000000e+00  6.50000000e-01  1.09090909e-01
   1.30054979e-01  5.17772438e-02]
 [ 3.47266881e-01  0.00000000e+00  4.04545455e-01  1.09090909e-01
   6.31965376e-02  1.85676541e-01]
 [ 5.52470038e-01  0.00000000e+00  1.00000000e+00  1.09090909e-01
   5.87876179e-02  3.60145861e-01]
 [ 0.00000000e+00  4.93753123e-01  5.81818182e-01  1.00000000e+00
   2.21352523e-01  2.08411884e-01]
 [ 3.34989769e-01  4.46276862e-01  4.00000000e-01  4.90909091e-01
   6.85340270e-01  1.27113756e-01]
 [ 8.92429114e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   4.49339207e-01  6.65447150e-01]
 [ 8.48289974e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   4.24182404e-01  5.43104080e-01]
 [ 3.79713534e-01  6.42678661e-01  3.54545455e-01  2.36363636e-01
   4.79536734e-01  4.67180965e-01]
 [ 4.23852675e-01  0.00000000e+00  3.63636364e-01  4.90909091e-01
   5.51313428e-01  3.23645119e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   1.34801762e-01  4.90739014e-01]
 [ 2.98158433e-01  0.00000000e+00  0.00000000e+00  3.63636364e-02
   6.08418992e-01  2.30480396e-02]
 [ 6.96287635e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   5.55055908e-01  5.42927195e-01]
 [ 1.57263958e-01  6.09195402e-01  4.36363636e-01  1.00000000e+00
   2.77128521e-01  4.14206065e-01]
 [ 5.84624379e-02  4.69765117e-01  6.50000000e-01  1.00000000e+00
   7.84376221e-02  2.07234574e-01]
 [ 4.09237065e-01  0.00000000e+00  2.72727273e-01  4.90909091e-01
   2.64578020e-01  3.48862205e-01]
 [ 0.00000000e+00  5.91204398e-01  2.63636364e-01  4.90909091e-01
   2.93194528e-01  4.77480952e-01]
 [ 5.05700088e-01  0.00000000e+00  0.00000000e+00  3.63636364e-02
   8.58220669e-01  1.08857692e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   2.08027411e-01  4.71944812e-01]
 [ 0.00000000e+00  5.64717641e-01  3.63636364e-01  4.90909091e-01
   1.80166359e-01  5.83539221e-01]
 [ 5.52470038e-01  0.00000000e+00  1.00000000e+00  4.90909091e-01
   5.87876179e-02  3.60145861e-01]
 [ 0.00000000e+00  8.94052974e-01  3.54545455e-01  4.90909091e-01
   6.90988655e-01  3.06656318e-02]
 [ 2.73019585e-01  7.99100450e-01  4.40909091e-01  1.00000000e+00
   4.43952997e-01  4.68498782e-01]
 [ 0.00000000e+00  4.78260870e-01  2.40909091e-01  2.36363636e-01
   3.04052863e-01  2.03864265e-01]
 [ 4.94007600e-01  7.14642679e-01  3.63636364e-01  4.90909091e-01
   6.58821355e-01  6.37769105e-01]
 [ 0.00000000e+00  5.02248876e-01  3.40909091e-01  4.90909091e-01
   3.70808058e-01  2.03535579e-01]
 [ 6.25548085e-01  8.19590205e-01  4.54545455e-01  4.90909091e-01
   6.03433585e-01  3.11988678e-01]
 [ 4.35252850e-01  5.79710145e-01  6.81818182e-01  4.90909091e-01
   5.14265795e-01  4.41748852e-01]
 [ 0.00000000e+00  0.00000000e+00  4.31818182e-01  4.90909091e-01
   3.81018626e-02  7.17738205e-01]
 [ 6.89856767e-01  0.00000000e+00  0.00000000e+00  3.63636364e-02
   5.88465819e-01  3.02703325e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   1.56912104e-01  4.84445694e-01]
 [ 0.00000000e+00  6.07696152e-01  2.59090909e-01  2.36363636e-01
   3.55771008e-01  4.77316919e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  3.63636364e-02
   2.11785325e-01  2.41657291e-01]
 [ 5.55393160e-02  7.04647676e-01  4.95454545e-01  3.63636364e-02
   9.29691303e-02  2.79353982e-01]
 [ 5.94855305e-01  0.00000000e+00  0.00000000e+00  1.09090909e-01
   2.41243658e-01  6.66262796e-01]
 [ 3.87605963e-01  5.16241879e-01  3.36363636e-01  4.90909091e-01
   7.18778570e-01  3.22854293e-01]
 [ 8.41859106e-01  0.00000000e+00  0.00000000e+00  1.09090909e-01
   4.49339207e-01  4.13153223e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   2.17416756e-01  5.43247793e-01]
 [ 3.10727857e-01  0.00000000e+00  7.50000000e-01  1.09090909e-01
   5.06607930e-02  4.37860214e-02]
 [ 3.58374744e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   5.15853681e-01  3.04464875e-01]
 [ 0.00000000e+00  9.99500250e-01  5.90909091e-01  4.90909091e-01
   5.14030658e-01  9.07942598e-02]
 [ 5.17392575e-01  0.00000000e+00  5.04545455e-01  1.00000000e+00
   8.96745039e-02  3.60256789e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   7.48898678e-02  4.30749826e-01]
 [ 6.43086817e-02  6.59670165e-01  3.86363636e-01  3.63636364e-02
   8.72537611e-02  2.79321227e-01]
 [ 5.14469453e-02  7.89605197e-01  6.95454545e-01  4.90909091e-01
   1.78405304e-01  6.63741515e-01]
 [ 0.00000000e+00  6.26686657e-01  3.54545455e-01  4.90909091e-01
   3.55743235e-01  4.75793031e-01]
 [ 3.39082140e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   5.17880030e-01  2.01749810e-01]
 [ 4.76468869e-01  6.39680160e-01  3.63636364e-01  4.90909091e-01
   5.74916729e-01  6.32665976e-01]
 [ 5.84624379e-01  0.00000000e+00  0.00000000e+00  1.09090909e-01
   7.26187716e-01  2.01813302e-01]
 [ 3.79421222e-01  5.92703648e-01  1.63636364e-01  2.36363636e-01
   4.56000759e-01  4.68250206e-01]
 [ 0.00000000e+00  8.73063468e-01  4.63636364e-01  2.36363636e-01
   2.77329645e-01  4.77085486e-01]
 [ 2.88804443e-01  3.84807596e-01  2.95454545e-01  4.90909091e-01
   2.86290696e-01  3.34503933e-01]
 [ 0.00000000e+00  5.02248876e-01  3.40909091e-01  2.36363636e-01
   3.56063434e-01  2.08196839e-01]
 [ 0.00000000e+00  5.34732634e-01  2.77272727e-01  4.90909091e-01
   2.15491554e-01  1.72474936e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   2.40555333e-01  4.65750068e-01]
 [ 0.00000000e+00  4.72763618e-01  2.09090909e-01  2.36363636e-01
   3.53511354e-01  2.08527090e-01]
 [ 3.24466530e-01  4.29785107e-01  2.27272727e-01  4.90909091e-01
   2.89745659e-01  3.28525761e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   1.26761638e-01  3.77769592e-01]
 [ 0.00000000e+00  8.70564718e-01  5.31818182e-01  2.36363636e-01
   2.77168512e-01  4.77115390e-01]
 [ 2.76819643e-01  0.00000000e+00  5.18181818e-01  1.00000000e+00
   5.42010293e-02  2.08413067e-01]
 [ 3.90821397e-01  0.00000000e+00  2.50000000e-01  4.90909091e-01
   2.46134577e-01  3.22788127e-01]
 [ 5.11546331e-01  0.00000000e+00  9.09090909e-02  4.90909091e-01
   5.46324959e-01  3.14007948e-01]
 [ 7.01549255e-02  3.94802599e-01  5.27272727e-01  4.90909091e-01
   5.19415526e-02  4.78443691e-01]
 [ 5.20315697e-01  6.94652674e-01  8.18181818e-01  4.90909091e-01
   5.15032074e-01  4.78582870e-01]
 [ 0.00000000e+00  7.14642679e-01  4.09090909e-01  4.90909091e-01
   6.02340012e-01  6.35871367e-01]
 [ 2.76819643e-01  0.00000000e+00  5.18181818e-01  4.90909091e-01
   5.42010293e-02  2.08413067e-01]
 [ 3.49897691e-01  0.00000000e+00  3.27272727e-01  4.90909091e-01
   1.95091622e-01  3.29453807e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   1.17015419e-01  4.97785561e-01]
 [ 0.00000000e+00  6.25187406e-01  5.45454545e-01  4.90909091e-01
   1.87619146e-01  4.77161733e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
  -1.91289573e-04  8.69176566e-02]
 [ 5.34931307e-01  0.00000000e+00  4.09090909e-01  4.90909091e-01
   6.02340012e-01  9.82527015e-02]
 [ 2.92312189e-01  3.89805097e-01  4.09090909e-01  4.90909091e-01
   3.05872027e-01  2.34880622e-01]
 [ 6.12101725e-01  0.00000000e+00  0.00000000e+00  1.09090909e-01
   6.83593569e-01  4.12209297e-01]
 [ 2.85881321e-01  1.22438781e-01  5.09090909e-01  3.63636364e-02
   1.87053565e-01  4.76977755e-01]
 [ 2.86758258e-01  1.22438781e-01  3.13636364e-01  4.90909091e-01
   3.55887492e-01  4.77207788e-01]
 [ 7.01549255e-02  3.94802599e-01  5.27272727e-01  1.00000000e+00
   5.19415526e-02  4.78443691e-01]
 [ 3.06927799e-01  9.64517741e-01  2.72727273e-01  4.90909091e-01
   7.31331919e-01  6.34368787e-01]
 [ 5.84624379e-02  4.69765117e-01  6.31818182e-01  3.63636364e-02
   7.84376221e-02  2.07234574e-01]
 [ 3.87021339e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   4.28599314e-01  2.89401729e-01]
 [ 2.95235311e-01  3.89805097e-01  4.54545455e-01  4.90909091e-01
   2.35711058e-01  3.34786548e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   7.48898678e-02  4.12907188e-01]
 [ 5.20315697e-01  9.34532734e-01  3.18181818e-01  4.90909091e-01
   7.09373470e-01  5.14227925e-01]
 [ 0.00000000e+00  6.20189905e-01  5.13636364e-01  3.63636364e-02
   2.77139267e-01  4.77219275e-01]
 [ 3.41420637e-01  4.54772614e-01  3.18181818e-01  4.90909091e-01
   2.54679623e-01  5.48276537e-01]
 [ 2.98158433e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   6.08418992e-01  2.30480396e-02]
 [ 0.00000000e+00  6.25687156e-01  4.50000000e-01  2.36363636e-01
   3.71615881e-01  4.70202059e-01]
 [ 1.57263958e-01  6.09195402e-01  4.36363636e-01  4.90909091e-01
   2.77128521e-01  4.14206065e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   1.26761638e-01  3.77769592e-01]
 [ 0.00000000e+00  7.17641179e-01  0.00000000e+00  3.63636364e-02
   4.44800427e-01  2.13966452e-01]
 [ 0.00000000e+00  5.79710145e-01  4.54545455e-01  4.90909091e-01
   2.05133116e-01  9.36409168e-02]
 [ 0.00000000e+00  6.26686657e-01  3.54545455e-01  3.63636364e-02
   3.55743235e-01  4.75793031e-01]
 [ 1.32125110e-01  6.09695152e-01  3.72727273e-01  2.36363636e-01
   4.08876733e-01  4.77176944e-01]
 [ 7.60011692e-02  6.14692654e-01  1.77272727e-01  3.63636364e-02
   8.07366668e-02  2.79283692e-01]
 [ 8.41859106e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   4.49339207e-01  4.13153223e-01]
 [ 0.00000000e+00  6.25687156e-01  4.50000000e-01  1.00000000e+00
   3.71615881e-01  4.70202059e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   4.99265786e-02  1.00000000e+00]
 [ 0.00000000e+00  6.07696152e-01  3.04545455e-01  4.90909091e-01
   2.93722499e-01  4.77156403e-01]]
In [72]:
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Assuming X and y are already defined
# X is your feature set (as a pandas DataFrame)
# y is your target variable

# Ranked feature importance order from your prior RandomForestRegressor
ranked_features = [
    'Age',
    'Water_Cement_Ratio',
    'Superplasticizer',
    'Blast Furnace Slag',
    'Coarse_Fine_Ratio',
    'Fly Ash'
]

# Split data into training and testing sets once, before the loop
X_train_full, X_test_full, y_train_full, y_test_full = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Perform the ablation study
num_features_list = []
r2_scores = []
features_used_list = []

# Ensure X_train_scaled and X_test_scaled remain unchanged globally
X_train_scaled_copy = X_train_scaled.copy()
X_test_scaled_copy = X_test_scaled.copy()

# Iterate over ranked features and perform ablation study
for i in range(len(ranked_features), 0, -1):
    selected_features = ranked_features[:i]
    
    # Use local variables to avoid global changes to scaled data
    local_X_train = X_train_full[selected_features].copy()
    local_X_test = X_test_full[selected_features].copy()
    
    # Apply scaling locally and keep as DataFrames
    scaler = MinMaxScaler()
    local_X_train_scaled = pd.DataFrame(
        scaler.fit_transform(local_X_train),
        columns=local_X_train.columns,
        index=local_X_train.index
    )
    local_X_test_scaled = pd.DataFrame(
        scaler.transform(local_X_test),
        columns=local_X_test.columns,
        index=local_X_test.index
    )
    
    # Initialize Random Forest model
    rf_model = RandomForestRegressor(
        n_estimators=386,
        max_depth=30,
        min_samples_split=2,
        min_samples_leaf=1,
        max_features='sqrt',
        random_state=42
    )
    
    # Train the model
    rf_model.fit(local_X_train_scaled, y_train_full)
    
    # Make predictions and calculate R-squared
    y_pred = rf_model.predict(local_X_test_scaled)
    r2 = r2_score(y_test_full, y_pred)
    
    # Store results
    num_features_list.append(i)
    r2_scores.append(r2)
    features_used_list.append(selected_features)

# Restore the global X_train_scaled and X_test_scaled values
X_train_scaled = X_train_scaled_copy
X_test_scaled = X_test_scaled_copy

# Plotting the R-squared values vs number of features
plt.figure(figsize=(10, 6))
plt.plot(
    num_features_list,
    r2_scores,
    marker='s',
    markersize=8,
    linewidth=2,
    color='darkblue'
)
plt.xlabel('Number of Features', fontsize=14)
plt.ylabel('R-Squared Value', fontsize=14)
plt.title(
    'Ablation Study of R-Squared Values with Varying Feature Counts',
    fontsize=16,
    fontweight='bold'
)
plt.gca().invert_xaxis()
plt.xticks(num_features_list, [f'{i} features' for i in num_features_list])
plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.show()

# Output the R-squared values and features kept to verify
for num_features, r2, features in zip(num_features_list, r2_scores, features_used_list):
    print(f"Number of features: {num_features}, R-Squared: {r2:.4f}, Features used: {features}")
No description has been provided for this image
Number of features: 6, R-Squared: 0.8810, Features used: ['Age', 'Water_Cement_Ratio', 'Superplasticizer', 'Blast Furnace Slag', 'Coarse_Fine_Ratio', 'Fly Ash']
Number of features: 5, R-Squared: 0.8841, Features used: ['Age', 'Water_Cement_Ratio', 'Superplasticizer', 'Blast Furnace Slag', 'Coarse_Fine_Ratio']
Number of features: 4, R-Squared: 0.8798, Features used: ['Age', 'Water_Cement_Ratio', 'Superplasticizer', 'Blast Furnace Slag']
Number of features: 3, R-Squared: 0.7634, Features used: ['Age', 'Water_Cement_Ratio', 'Superplasticizer']
Number of features: 2, R-Squared: 0.6561, Features used: ['Age', 'Water_Cement_Ratio']
Number of features: 1, R-Squared: 0.2813, Features used: ['Age']
In [73]:
# Print the column names of the original DataFrame before scaling
print(X_test.columns)

# Print the contents of the scaled array
print(X_test_scaled)
Index(['Blast Furnace Slag', 'Fly Ash', 'Superplasticizer', 'Age',
       'Water_Cement_Ratio', 'Coarse_Fine_Ratio'],
      dtype='object')
[[ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   2.17595316e-01  2.96941706e-01]
 [ 4.95761473e-01  0.00000000e+00  3.68181818e-01  4.90909091e-01
   1.93286104e-01  9.87721478e-02]
 [ 7.60011692e-02  6.14692654e-01  1.77272727e-01  3.63636364e-02
   9.21760559e-02  3.73120782e-01]
 [ 3.53697749e-01  0.00000000e+00  3.18181818e-01  4.90909091e-01
   2.09141122e-01  1.90241400e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   7.48898678e-02  4.30749826e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  3.63636364e-02
   1.87262378e-01  2.89656956e-01]
 [ 4.17421806e-01  5.55722139e-01  1.00454545e+00  4.90909091e-01
   1.63175654e-01  5.27925323e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.00000000e+00
   1.81448589e-01  4.32629916e-01]
 [ 0.00000000e+00  5.01749125e-01  3.95454545e-01  2.36363636e-01
   2.93098385e-01  2.12981910e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   2.16096353e-01  5.42666762e-01]
 [ 0.00000000e+00  5.77711144e-01  4.72727273e-01  4.90909091e-01
   2.04897315e-01  9.29656152e-02]
 [ 0.00000000e+00  5.02248876e-01  3.40909091e-01  4.90909091e-01
   3.56063434e-01  2.08196839e-01]
 [ 0.00000000e+00  4.99250375e-01  5.63636364e-01  2.36363636e-01
   1.87363465e-01  2.15944177e-01]
 [ 0.00000000e+00  5.90704648e-01  2.77272727e-01  4.90909091e-01
   3.55604610e-01  4.77243909e-01]
 [ 7.60011692e-01  0.00000000e+00  5.90909091e-01  4.90909091e-01
   5.13251243e-01  1.29218101e-01]
 [ 4.38468284e-02  9.74512744e-01  2.72727273e-01  4.90909091e-01
   5.88024147e-01  5.67988257e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  3.63636364e-02
   1.68591613e-01  4.20688880e-01]
 [ 5.52470038e-01  0.00000000e+00  4.31818182e-01  1.00000000e+00
   1.48287405e-01  3.60145861e-01]
 [ 4.38468284e-02  7.04647676e-01  2.50000000e-01  1.09090909e-01
   2.18604579e-01  1.42874319e-01]
 [ 0.00000000e+00  4.83258371e-01  2.04545455e-01  3.63636364e-02
   3.56456465e-01  2.08510169e-01]
 [ 0.00000000e+00  6.07696152e-01  2.59090909e-01  1.00000000e+00
   3.55771008e-01  4.77316919e-01]
 [ 4.87576732e-01  0.00000000e+00  0.00000000e+00  3.63636364e-02
   3.32853834e-01  5.33427952e-01]
 [ 0.00000000e+00  4.78260870e-01  2.50000000e-01  1.00000000e+00
   2.93069016e-01  2.08398661e-01]
 [ 6.92779889e-01  0.00000000e+00  5.45454545e-01  4.90909091e-01
   5.49192364e-01  7.25023773e-01]
 [ 5.87255189e-01  0.00000000e+00  5.09090909e-01  3.63636364e-02
   1.40676811e-01  3.60184177e-01]
 [ 4.38468284e-02  7.04647676e-01  2.50000000e-01  4.90909091e-01
   2.18604579e-01  1.42874319e-01]
 [ 0.00000000e+00  5.02248876e-01  3.40909091e-01  1.00000000e+00
   3.56063434e-01  2.08196839e-01]
 [ 0.00000000e+00  4.80759620e-01  4.27272727e-01  3.63636364e-02
   1.86511187e-01  2.08496896e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   1.78583531e-01  4.77845612e-01]
 [ 2.89389068e-01  3.84807596e-01  2.72727273e-01  4.90909091e-01
   2.87812041e-01  3.35040964e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  3.63636364e-02
   3.38273725e-01  1.89443956e-01]
 [ 4.34375913e-01  0.00000000e+00  3.86363636e-01  4.90909091e-01
   1.84805746e-01  4.59022389e-01]
 [ 0.00000000e+00  4.78260870e-01  2.40909091e-01  1.00000000e+00
   3.04052863e-01  2.03864265e-01]
 [ 5.84624379e-02  4.69765117e-01  6.31818182e-01  4.90909091e-01
   7.84376221e-02  2.07234574e-01]
 [ 7.60011692e-02  6.14692654e-01  1.77272727e-01  1.00000000e+00
   9.21760559e-02  3.73120782e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   3.11993936e-01  5.43284175e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   2.17416756e-01  5.43247793e-01]
 [ 0.00000000e+00  0.00000000e+00  3.77272727e-01  4.90909091e-01
   2.11897688e-02  8.55545617e-02]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  3.63636364e-02
   2.51944308e-02  1.00000000e+00]
 [ 0.00000000e+00  7.51624188e-01  5.31818182e-01  4.90909091e-01
   5.13794845e-01  5.28457921e-01]
 [ 0.00000000e+00  4.99750125e-01  4.54545455e-01  4.90909091e-01
   2.26306541e-01  4.80780542e-01]
 [ 3.39082140e-01  4.51274363e-01  4.04545455e-01  4.90909091e-01
   2.30774952e-01  2.31767291e-01]
 [ 0.00000000e+00  0.00000000e+00  3.63636364e-01  4.90909091e-01
   1.80706433e-01  3.22922732e-01]
 [ 8.48289974e-01  0.00000000e+00  0.00000000e+00  1.09090909e-01
   4.24182404e-01  5.43104080e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
  -1.91289573e-04  8.69176566e-02]
 [ 0.00000000e+00  8.74562719e-01  8.18181818e-01  4.90909091e-01
   5.10013868e-01  2.15148709e-01]
 [ 0.00000000e+00  8.16091954e-01  2.04545455e-01  4.90909091e-01
   4.88414795e-01  4.77196833e-01]
 [ 7.01549255e-02  3.94802599e-01  5.27272727e-01  3.63636364e-02
   5.19415526e-02  4.78443691e-01]
 [ 6.43086817e-02  6.59670165e-01  3.86363636e-01  1.00000000e+00
   8.72537611e-02  1.91865058e-01]
 [ 0.00000000e+00  0.00000000e+00  1.36363636e-01  4.90909091e-01
   7.48399512e-02  9.98275965e-01]
 [ 3.79713534e-01  6.42678661e-01  3.54545455e-01  1.00000000e+00
   4.79536734e-01  4.67180965e-01]
 [ 0.00000000e+00  9.74012994e-01  5.00000000e-01  4.90909091e-01
   6.90855879e-01  3.70300668e-01]
 [ 1.57263958e-01  6.09195402e-01  4.36363636e-01  2.36363636e-01
   2.77128521e-01  4.14206065e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   2.11785325e-01  3.26631853e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   1.87262378e-01  2.89656956e-01]
 [ 5.55393160e-02  7.04647676e-01  4.95454545e-01  1.00000000e+00
   9.29691303e-02  2.79353982e-01]
 [ 5.61239404e-01  0.00000000e+00  0.00000000e+00  1.09090909e-01
   2.41311796e-01  4.13142079e-01]
 [ 7.60011692e-02  6.14692654e-01  1.77272727e-01  1.09090909e-01
   9.21760559e-02  3.73120782e-01]
 [ 3.10727857e-01  0.00000000e+00  8.45454545e-01  1.09090909e-01
   4.75770925e-02  2.66873658e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   1.28242780e-01  5.43096270e-01]
 [ 6.21163403e-01  0.00000000e+00  6.50000000e-01  1.09090909e-01
   1.30054979e-01  5.17772438e-02]
 [ 3.47266881e-01  0.00000000e+00  4.04545455e-01  1.09090909e-01
   6.31965376e-02  1.85676541e-01]
 [ 5.52470038e-01  0.00000000e+00  1.00000000e+00  1.09090909e-01
   5.87876179e-02  3.60145861e-01]
 [ 0.00000000e+00  4.93753123e-01  5.81818182e-01  1.00000000e+00
   2.21352523e-01  2.08411884e-01]
 [ 3.34989769e-01  4.46276862e-01  4.00000000e-01  4.90909091e-01
   6.85340270e-01  1.27113756e-01]
 [ 8.92429114e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   4.49339207e-01  6.65447150e-01]
 [ 8.48289974e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   4.24182404e-01  5.43104080e-01]
 [ 3.79713534e-01  6.42678661e-01  3.54545455e-01  2.36363636e-01
   4.79536734e-01  4.67180965e-01]
 [ 4.23852675e-01  0.00000000e+00  3.63636364e-01  4.90909091e-01
   5.51313428e-01  3.23645119e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   1.34801762e-01  4.90739014e-01]
 [ 2.98158433e-01  0.00000000e+00  0.00000000e+00  3.63636364e-02
   6.08418992e-01  2.30480396e-02]
 [ 6.96287635e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   5.55055908e-01  5.42927195e-01]
 [ 1.57263958e-01  6.09195402e-01  4.36363636e-01  1.00000000e+00
   2.77128521e-01  4.14206065e-01]
 [ 5.84624379e-02  4.69765117e-01  6.50000000e-01  1.00000000e+00
   7.84376221e-02  2.07234574e-01]
 [ 4.09237065e-01  0.00000000e+00  2.72727273e-01  4.90909091e-01
   2.64578020e-01  3.48862205e-01]
 [ 0.00000000e+00  5.91204398e-01  2.63636364e-01  4.90909091e-01
   2.93194528e-01  4.77480952e-01]
 [ 5.05700088e-01  0.00000000e+00  0.00000000e+00  3.63636364e-02
   8.58220669e-01  1.08857692e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   2.08027411e-01  4.71944812e-01]
 [ 0.00000000e+00  5.64717641e-01  3.63636364e-01  4.90909091e-01
   1.80166359e-01  5.83539221e-01]
 [ 5.52470038e-01  0.00000000e+00  1.00000000e+00  4.90909091e-01
   5.87876179e-02  3.60145861e-01]
 [ 0.00000000e+00  8.94052974e-01  3.54545455e-01  4.90909091e-01
   6.90988655e-01  3.06656318e-02]
 [ 2.73019585e-01  7.99100450e-01  4.40909091e-01  1.00000000e+00
   4.43952997e-01  4.68498782e-01]
 [ 0.00000000e+00  4.78260870e-01  2.40909091e-01  2.36363636e-01
   3.04052863e-01  2.03864265e-01]
 [ 4.94007600e-01  7.14642679e-01  3.63636364e-01  4.90909091e-01
   6.58821355e-01  6.37769105e-01]
 [ 0.00000000e+00  5.02248876e-01  3.40909091e-01  4.90909091e-01
   3.70808058e-01  2.03535579e-01]
 [ 6.25548085e-01  8.19590205e-01  4.54545455e-01  4.90909091e-01
   6.03433585e-01  3.11988678e-01]
 [ 4.35252850e-01  5.79710145e-01  6.81818182e-01  4.90909091e-01
   5.14265795e-01  4.41748852e-01]
 [ 0.00000000e+00  0.00000000e+00  4.31818182e-01  4.90909091e-01
   3.81018626e-02  7.17738205e-01]
 [ 6.89856767e-01  0.00000000e+00  0.00000000e+00  3.63636364e-02
   5.88465819e-01  3.02703325e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   1.56912104e-01  4.84445694e-01]
 [ 0.00000000e+00  6.07696152e-01  2.59090909e-01  2.36363636e-01
   3.55771008e-01  4.77316919e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  3.63636364e-02
   2.11785325e-01  2.41657291e-01]
 [ 5.55393160e-02  7.04647676e-01  4.95454545e-01  3.63636364e-02
   9.29691303e-02  2.79353982e-01]
 [ 5.94855305e-01  0.00000000e+00  0.00000000e+00  1.09090909e-01
   2.41243658e-01  6.66262796e-01]
 [ 3.87605963e-01  5.16241879e-01  3.36363636e-01  4.90909091e-01
   7.18778570e-01  3.22854293e-01]
 [ 8.41859106e-01  0.00000000e+00  0.00000000e+00  1.09090909e-01
   4.49339207e-01  4.13153223e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   2.17416756e-01  5.43247793e-01]
 [ 3.10727857e-01  0.00000000e+00  7.50000000e-01  1.09090909e-01
   5.06607930e-02  4.37860214e-02]
 [ 3.58374744e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   5.15853681e-01  3.04464875e-01]
 [ 0.00000000e+00  9.99500250e-01  5.90909091e-01  4.90909091e-01
   5.14030658e-01  9.07942598e-02]
 [ 5.17392575e-01  0.00000000e+00  5.04545455e-01  1.00000000e+00
   8.96745039e-02  3.60256789e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   7.48898678e-02  4.30749826e-01]
 [ 6.43086817e-02  6.59670165e-01  3.86363636e-01  3.63636364e-02
   8.72537611e-02  2.79321227e-01]
 [ 5.14469453e-02  7.89605197e-01  6.95454545e-01  4.90909091e-01
   1.78405304e-01  6.63741515e-01]
 [ 0.00000000e+00  6.26686657e-01  3.54545455e-01  4.90909091e-01
   3.55743235e-01  4.75793031e-01]
 [ 3.39082140e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   5.17880030e-01  2.01749810e-01]
 [ 4.76468869e-01  6.39680160e-01  3.63636364e-01  4.90909091e-01
   5.74916729e-01  6.32665976e-01]
 [ 5.84624379e-01  0.00000000e+00  0.00000000e+00  1.09090909e-01
   7.26187716e-01  2.01813302e-01]
 [ 3.79421222e-01  5.92703648e-01  1.63636364e-01  2.36363636e-01
   4.56000759e-01  4.68250206e-01]
 [ 0.00000000e+00  8.73063468e-01  4.63636364e-01  2.36363636e-01
   2.77329645e-01  4.77085486e-01]
 [ 2.88804443e-01  3.84807596e-01  2.95454545e-01  4.90909091e-01
   2.86290696e-01  3.34503933e-01]
 [ 0.00000000e+00  5.02248876e-01  3.40909091e-01  2.36363636e-01
   3.56063434e-01  2.08196839e-01]
 [ 0.00000000e+00  5.34732634e-01  2.77272727e-01  4.90909091e-01
   2.15491554e-01  1.72474936e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   2.40555333e-01  4.65750068e-01]
 [ 0.00000000e+00  4.72763618e-01  2.09090909e-01  2.36363636e-01
   3.53511354e-01  2.08527090e-01]
 [ 3.24466530e-01  4.29785107e-01  2.27272727e-01  4.90909091e-01
   2.89745659e-01  3.28525761e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   1.26761638e-01  3.77769592e-01]
 [ 0.00000000e+00  8.70564718e-01  5.31818182e-01  2.36363636e-01
   2.77168512e-01  4.77115390e-01]
 [ 2.76819643e-01  0.00000000e+00  5.18181818e-01  1.00000000e+00
   5.42010293e-02  2.08413067e-01]
 [ 3.90821397e-01  0.00000000e+00  2.50000000e-01  4.90909091e-01
   2.46134577e-01  3.22788127e-01]
 [ 5.11546331e-01  0.00000000e+00  9.09090909e-02  4.90909091e-01
   5.46324959e-01  3.14007948e-01]
 [ 7.01549255e-02  3.94802599e-01  5.27272727e-01  4.90909091e-01
   5.19415526e-02  4.78443691e-01]
 [ 5.20315697e-01  6.94652674e-01  8.18181818e-01  4.90909091e-01
   5.15032074e-01  4.78582870e-01]
 [ 0.00000000e+00  7.14642679e-01  4.09090909e-01  4.90909091e-01
   6.02340012e-01  6.35871367e-01]
 [ 2.76819643e-01  0.00000000e+00  5.18181818e-01  4.90909091e-01
   5.42010293e-02  2.08413067e-01]
 [ 3.49897691e-01  0.00000000e+00  3.27272727e-01  4.90909091e-01
   1.95091622e-01  3.29453807e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   1.17015419e-01  4.97785561e-01]
 [ 0.00000000e+00  6.25187406e-01  5.45454545e-01  4.90909091e-01
   1.87619146e-01  4.77161733e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
  -1.91289573e-04  8.69176566e-02]
 [ 5.34931307e-01  0.00000000e+00  4.09090909e-01  4.90909091e-01
   6.02340012e-01  9.82527015e-02]
 [ 2.92312189e-01  3.89805097e-01  4.09090909e-01  4.90909091e-01
   3.05872027e-01  2.34880622e-01]
 [ 6.12101725e-01  0.00000000e+00  0.00000000e+00  1.09090909e-01
   6.83593569e-01  4.12209297e-01]
 [ 2.85881321e-01  1.22438781e-01  5.09090909e-01  3.63636364e-02
   1.87053565e-01  4.76977755e-01]
 [ 2.86758258e-01  1.22438781e-01  3.13636364e-01  4.90909091e-01
   3.55887492e-01  4.77207788e-01]
 [ 7.01549255e-02  3.94802599e-01  5.27272727e-01  1.00000000e+00
   5.19415526e-02  4.78443691e-01]
 [ 3.06927799e-01  9.64517741e-01  2.72727273e-01  4.90909091e-01
   7.31331919e-01  6.34368787e-01]
 [ 5.84624379e-02  4.69765117e-01  6.31818182e-01  3.63636364e-02
   7.84376221e-02  2.07234574e-01]
 [ 3.87021339e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   4.28599314e-01  2.89401729e-01]
 [ 2.95235311e-01  3.89805097e-01  4.54545455e-01  4.90909091e-01
   2.35711058e-01  3.34786548e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  4.90909091e-01
   7.48898678e-02  4.12907188e-01]
 [ 5.20315697e-01  9.34532734e-01  3.18181818e-01  4.90909091e-01
   7.09373470e-01  5.14227925e-01]
 [ 0.00000000e+00  6.20189905e-01  5.13636364e-01  3.63636364e-02
   2.77139267e-01  4.77219275e-01]
 [ 3.41420637e-01  4.54772614e-01  3.18181818e-01  4.90909091e-01
   2.54679623e-01  5.48276537e-01]
 [ 2.98158433e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   6.08418992e-01  2.30480396e-02]
 [ 0.00000000e+00  6.25687156e-01  4.50000000e-01  2.36363636e-01
   3.71615881e-01  4.70202059e-01]
 [ 1.57263958e-01  6.09195402e-01  4.36363636e-01  4.90909091e-01
   2.77128521e-01  4.14206065e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  0.00000000e+00
   1.26761638e-01  3.77769592e-01]
 [ 0.00000000e+00  7.17641179e-01  0.00000000e+00  3.63636364e-02
   4.44800427e-01  2.13966452e-01]
 [ 0.00000000e+00  5.79710145e-01  4.54545455e-01  4.90909091e-01
   2.05133116e-01  9.36409168e-02]
 [ 0.00000000e+00  6.26686657e-01  3.54545455e-01  3.63636364e-02
   3.55743235e-01  4.75793031e-01]
 [ 1.32125110e-01  6.09695152e-01  3.72727273e-01  2.36363636e-01
   4.08876733e-01  4.77176944e-01]
 [ 7.60011692e-02  6.14692654e-01  1.77272727e-01  3.63636364e-02
   8.07366668e-02  2.79283692e-01]
 [ 8.41859106e-01  0.00000000e+00  0.00000000e+00  4.90909091e-01
   4.49339207e-01  4.13153223e-01]
 [ 0.00000000e+00  6.25687156e-01  4.50000000e-01  1.00000000e+00
   3.71615881e-01  4.70202059e-01]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00  1.09090909e-01
   4.99265786e-02  1.00000000e+00]
 [ 0.00000000e+00  6.07696152e-01  3.04545455e-01  4.90909091e-01
   2.93722499e-01  4.77156403e-01]]
In [74]:
import shap
import matplotlib.pyplot as plt

# Assuming best_rf_model is already defined and trained
# Initialize the SHAP explainer using the trained Random Forest model
explainer = shap.TreeExplainer(best_rf_model)

# Calculate SHAP values for the dataset used with the model
# For regression models, shap_values will be a 2D array of shape (n_samples, n_features)
shap_values = explainer.shap_values(X)

# Alternatively, for newer versions of SHAP, you can use:
# shap_values = explainer(X)

# Create the SHAP summary plot without displaying it immediately
# Capture the matplotlib figure and axes objects
plt.figure(figsize=(10, 6))
shap.summary_plot(shap_values, X, show=False, plot_size=(10, 6))

# Get the current figure and axes
fig = plt.gcf()
ax = plt.gca()

# Customize the SHAP plot for a professional journal look
ax.set_title('SHAP Summary Plot of Feature Importance', fontsize=16, fontweight='bold', fontname='Arial')
ax.set_xlabel('SHAP Value (Impact on Output)', fontsize=14, fontweight='bold', fontname='Arial')
ax.set_ylabel('Features', fontsize=14, fontweight='bold', fontname='Arial')

# Customize ticks
ax.tick_params(axis='both', which='major', labelsize=12, labelcolor='black')
for label in ax.get_xticklabels():
    label.set_fontname('Arial')
    label.set_fontweight('bold')
for label in ax.get_yticklabels():
    label.set_fontname('Arial')
    label.set_fontweight('bold')

# Make grid lines more subtle
ax.grid(True, linestyle='--', linewidth=0.7)

# Save the figure with a high DPI for journal publication
plt.savefig('shap_summary_plot.png', dpi=1200, bbox_inches='tight')

# Display the customized plot
plt.show()
No description has been provided for this image

SHAP Summary Plot Interpretation¶

The SHAP summary plot provides insights into the feature importance and their impact on the model's predictions for concrete compressive strength. Each point on the plot represents a SHAP value for a feature and a specific prediction, indicating how much each feature contributed to that prediction.

Key Observations:¶

  • Feature Importance Ranking:

    • Age and Water_Cement_Ratio are the most significant features impacting the model's predictions, with a wide range of SHAP values. Higher values for these features tend to lead to a positive impact on the predicted concrete compressive strength.
  • Positive and Negative Impacts:

    • Water_Cement_Ratio:
      • A lower water-to-cement ratio is associated with higher compressive strength, as indicated by the negative SHAP values for higher ratios.
    • Age:
      • Increased age positively affects strength, as indicated by the overall positive SHAP values. Older concrete tends to have better strength characteristics.
  • Moderate Impact Features:

    • Superplasticizer and Blast Furnace Slag also play important roles but have more moderate effects compared to Age and Water_Cement_Ratio. The SHAP values for these features are clustered around zero, indicating they have a less pronounced effect on predictions.
  • Less Significant Features:

    • Coarse_Fine_Ratio and Fly Ash show minimal impact on predictions, as their SHAP values are close to zero. This suggests that these features do not significantly contribute to the model's predictive power in this context.

Conclusion:¶

The SHAP summary plot effectively illustrates the importance and influence of different features on the Random Forest model's predictions. Features such as Age and Water_Cement_Ratio are crucial in determining the concrete's compressive strength, while others, like Coarse_Fine_Ratio and Fly Ash, have less influence. This analysis can guide decision-making in optimizing concrete mixtures for desired strength characteristics.

SHAP Force Plot¶

In [75]:
%matplotlib inline
In [76]:
# Select the instance to analyze
instance_index = 0

# Print feature values for the selected instance
print("Feature values for the selected instance:")
print(X.iloc[instance_index])

# Print SHAP values for the selected instance
print("\nSHAP values for the selected instance:")
print(shap_values[instance_index])
Feature values for the selected instance:
Blast Furnace Slag     0.000000
Fly Ash                0.000000
Superplasticizer       2.500000
Age                   28.000000
Water_Cement_Ratio     0.300000
Coarse_Fine_Ratio      1.560651
Name: 1, dtype: float64

SHAP values for the selected instance:
[-0.94133444 -0.1868148  -0.84626405  6.10681821 27.33260161  0.31112131]
In [77]:
# Assuming y is your target variable (e.g., a pandas Series or NumPy array)

# For pandas Series
real_y_value = y.iloc[instance_index]

# For NumPy array or list
# real_y_value = y[instance_index]

print(f"The actual target value for instance {instance_index} is: {real_y_value}")
The actual target value for instance 0 is: 61.89
In [78]:
import shap
import matplotlib.pyplot as plt

# Initialize the SHAP explainer using the trained Random Forest model
explainer = shap.TreeExplainer(best_rf_model)

# Calculate SHAP values for the dataset used with the model
shap_values = explainer.shap_values(X)

# Select an instance to visualize (you can change the index)
instance_index = 0  # For the first instance in the dataset

# Create a Force Plot for the selected instance
plt.figure(figsize=(10, 6))
shap.initjs()  # Initialize JavaScript for rendering

# Use the force_plot function to visualize the SHAP values for the selected instance
shap.force_plot(explainer.expected_value, shap_values[instance_index], X.iloc[instance_index], matplotlib=True, show=False)

# Save the Force Plot to a PNG file
plt.savefig('force_plot.png', bbox_inches='tight')  # Save as a PNG file

# Save the Force Plot to an HTML file
shap.save_html('force_plot.html', shap.force_plot(explainer.expected_value, shap_values[instance_index], X.iloc[instance_index]))  # Save as an HTML file

# Show the plot (optional if you want to see it interactively)
plt.show()
No description has been provided for this image
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image

Water Plot¶

In [79]:
import shap
import matplotlib.pyplot as plt

# Initialize the SHAP explainer using the trained Random Forest model
explainer = shap.Explainer(best_rf_model, X)

# Calculate SHAP values for the dataset used with the model
shap_values = explainer(X)

# Select an instance to visualize (you can change the index)
instance_index = 0  # For the first instance in the dataset

# Extract the SHAP values for the selected instance
shap_value = shap_values[instance_index]

# Create a Waterfall Plot for the selected instance
shap.plots.waterfall(shap_value, max_display=10, show=False)

# Save the Waterfall Plot to a PNG file
plt.savefig('waterfall_plot.png', bbox_inches='tight')  # Save as a PNG file

# Show the plot
plt.show()
 99%|===================| 769/776 [00:37<00:00]        
No description has been provided for this image

Insights from SHAP Analysis¶

The selected instance for analysis has the following feature values:

  • Blast Furnace Slag: 0.00
  • Fly Ash: 0.00
  • Superplasticizer: 2.50
  • Age: 28.00
  • Water_Cement_Ratio: 0.30
  • Coarse_Fine_Ratio: 1.56

The corresponding SHAP values for these features are:

  • Blast Furnace Slag: -0.0819
  • Fly Ash: 0.7360
  • Superplasticizer: -0.4853
  • Age: 3.6882
  • Water_Cement_Ratio: 21.8129
  • Coarse_Fine_Ratio: 1.6849

Interpretation¶

  • The SHAP values indicate how each feature impacts the model's output for this specific prediction. Positive values push the prediction higher, while negative values push it lower.
  • Water_Cement_Ratio has the highest positive SHAP value, significantly increasing the predicted compressive strength.
  • Conversely, Blast Furnace Slag and Superplasticizer have negative SHAP values, which reduce the predicted strength.
  • This instance predicts a strength of 61.85, emphasizing the influence of the Water_Cement_Ratio and Age features.

Dependence Plot¶

In [80]:
import shap
import matplotlib.pyplot as plt

# Assuming best_rf_model is already defined and trained
# Initialize the SHAP explainer using the trained Random Forest model
explainer = shap.TreeExplainer(best_rf_model)

# Calculate SHAP values for the dataset used with the model
shap_values = explainer.shap_values(X)

# Select a feature for the dependence plot
feature_name = 'Water_Cement_Ratio'  # Change this to the feature you want to analyze

# Create the dependence plot
plt.figure(figsize=(10, 6))
shap.dependence_plot(feature_name, shap_values, X, show=False)  # Prevent immediate display

# Customize the title and show the plot
plt.title(f'Dependence Plot for {feature_name}', fontsize=16, fontweight='bold')
plt.show()  # Show the plot only once here
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image
In [81]:
import shap
import matplotlib.pyplot as plt

# Assuming best_rf_model is already defined and trained
# Initialize the SHAP explainer using the trained Random Forest model
explainer = shap.TreeExplainer(best_rf_model)

# Calculate SHAP values for the dataset used with the model
shap_values = explainer.shap_values(X)

# If shap_values is a list (classification problem), select one class
if isinstance(shap_values, list):
    shap_values = shap_values[0]  # Use the SHAP values for the first class

# Features to be plotted
features = X.columns

# Create a figure for the subplots
num_features = len(features)
fig, axs = plt.subplots(num_features, num_features, figsize=(20, 20))

# Iterate through each pair of features
for i in range(num_features):
    for j in range(num_features):
        if i != j:
            # Create the dependence plot for feature i, colored by feature j
            shap.dependence_plot(
                features[i],
                shap_values,
                X,
                interaction_index=features[j],
                ax=axs[i, j],
                show=False  # Prevent immediate display
            )
        else:
            axs[i, j].axis('off')  # Hide the diagonal plots

# Adjust layout and show the plot
plt.suptitle('SHAP Dependence Plots for Feature Combinations', fontsize=20)
plt.tight_layout()
plt.subplots_adjust(top=0.95)  # Adjust the top to fit the title
plt.show()
No description has been provided for this image
In [82]:
import shap
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap  # Import ListedColormap
import numpy as np

# Assuming best_rf_model is already defined and trained
# Initialize the SHAP explainer using the trained Random Forest model
explainer = shap.TreeExplainer(best_rf_model)

# Calculate SHAP values for the dataset used with the model
shap_values = explainer.shap_values(X)

# If shap_values is a list (classification problem), select one class
if isinstance(shap_values, list):
    shap_values = shap_values[0]  # Use the SHAP values for the first class

# Features to be plotted
features = X.columns

# Create a figure for the subplots
num_features = len(features)
fig, axs = plt.subplots(num_features, num_features, figsize=(20, 20))

# Define a custom colormap with colors ordered from low to high values
colors_list = ['green', 'blue', 'orange', 'red']  # Reordered colors
cmap = ListedColormap(colors_list)

# Iterate through each pair of features
for i in range(num_features):
    for j in range(num_features):
        if i != j:
            # Create the dependence plot for feature i, colored by feature j
            shap.dependence_plot(
                features[i],
                shap_values,
                X,
                interaction_index=features[j],
                ax=axs[i, j],
                show=False,  # Prevent immediate display
                cmap=cmap    # Use the custom colormap
            )
        else:
            axs[i, j].axis('off')  # Hide the diagonal plots

# Adjust layout and show the plot
plt.suptitle('SHAP Dependence Plots for Feature Combinations', fontsize=20)
plt.tight_layout()
plt.subplots_adjust(top=0.95)  # Adjust the top to fit the title
plt.show()
No description has been provided for this image

Bar Plot¶

In [83]:
import shap
import matplotlib.pyplot as plt
import numpy as np

# Assuming best_rf_model is already defined and trained
# Initialize the SHAP explainer using the trained Random Forest model
explainer = shap.TreeExplainer(best_rf_model)

# Calculate SHAP values for the dataset used with the model
shap_values = explainer.shap_values(X)

# Compute the mean absolute SHAP values for each feature
mean_shap_values = np.abs(shap_values).mean(axis=0)

# Create a bar plot for the mean absolute SHAP values
plt.figure(figsize=(10, 6))
shap.summary_plot(shap_values, X, plot_type="bar", show=False)  # Ensure 'show=False' to prevent auto-display

# Customize the plot, add labels, title, etc., if needed here

# Show the plot without creating an empty figure
plt.show()
No description has been provided for this image

Insights from SHAP Value Bar Plot¶

  • Age: This feature has the highest mean absolute SHAP value, indicating that it is the most significant predictor of the model's output. Age largely determines the concrete's strength as it represents the curing time, which is crucial for strength development.

  • Water_Cement_Ratio: Following age, the water to cement ratio is also highly impactful. This ratio is critical in concrete mix design, influencing workability, hydration, and ultimately, the compressive strength of concrete.

  • Superplasticizer: This component appears to have a considerable effect, enhancing the workability of the concrete mix while reducing the water requirement. Its significant role is likely due to its effectiveness in improving the concrete’s mechanical properties.

  • Blast Furnace Slag: Acts as a partial replacement for cement and impacts the strength development, particularly in the long term. Its influence in the model reflects its real-world effect on durability and strength enhancement.

  • Coarse_Fine_Ratio and Fly Ash: While these features have lower SHAP values, they still contribute to the predictive model, affecting the strength characteristics by altering the concrete's density and long-term performance.

Overall, the SHAP values align well with known influences in concrete science, highlighting the importance of mixture components and curing time on the final product's strength.

Partial Dependence Plot (PDP) for Water and Superplasticizer¶

In [84]:
from sklearn.inspection import PartialDependenceDisplay
import matplotlib.pyplot as plt

# Assuming best_rf_model and X are already defined and the model is trained
# Features to inspect
features = ['Water_Cement_Ratio', 'Age']  # Single feature or a pair of features

fig, ax = plt.subplots(figsize=(12, 8))
display = PartialDependenceDisplay.from_estimator(
    best_rf_model,
    X,
    features=features,
    ax=ax,
    grid_resolution=50
)

# Enhancing the plot aesthetics
ax.set_title('Partial Dependence Plot', fontsize=16)
plt.subplots_adjust(top=0.9)  # Adjust layout to not overlap with the title

plt.show()
No description has been provided for this image

Insights from Partial Dependence Plots¶

Water_Cement_Ratio¶

  • The plot for Water_Cement_Ratio shows a clear decreasing trend in the model's predicted strength as the water to cement ratio increases.
  • This trend begins steeply, indicating a significant negative impact on concrete strength when the ratio is low, gradually leveling off as the ratio increases past 1.0. This supports the common understanding in concrete technology that higher water to cement ratios generally lead to lower strength, as excessive water can dilute the cement paste, making it less effective in forming strong bonds.

Age¶

  • The Age plot reveals a pronounced positive relationship between the age of the concrete and its predicted strength, which sharply increases until about 40 days before plateauing.
  • This increase aligns with the hydration process of cement where concrete continues to cure and gain strength over time, particularly during the early stages. The plateau observed after around 40 days might indicate that most of the strength gain has occurred by this point, which is a crucial insight for construction timelines and durability assessments.

These plots provide valuable confirmations of well-known behaviors in concrete mix design and curing processes, and they underscore the utility of the model in capturing these important trends.

In [85]:
from sklearn.inspection import PartialDependenceDisplay
import matplotlib.pyplot as plt

# Features to inspect (as a tuple for interaction)
features = [('Water_Cement_Ratio', 'Age')]

# Adjust percentiles to include the entire range
display = PartialDependenceDisplay.from_estimator(
    best_rf_model,
    X,
    features=features,
    grid_resolution=50,
    percentiles=(0, 1),  # Include the whole range of both features
    kind='average',      # 'average' for regression models
)

# Enhancing the plot aesthetics
plt.suptitle('Partial Dependence Plot (Interaction)', fontsize=16)
plt.subplots_adjust(top=0.9)

plt.show()
No description has been provided for this image

Insights from Partial Dependence Plot (Interaction)¶

Interaction between Age and Water_Cement_Ratio¶

  • The interaction partial dependence plot reveals complex relationships between the age of the concrete and its water to cement ratio concerning predicted strength.
  • Notably, there is a pronounced gradient change in predicted strength across different ages and water to cement ratios:
    • Young Concrete (Age < 40 days): The plot shows a steep decline in strength as the water to cement ratio increases from 0.4 to around 1.0. This suggests that for younger concrete, maintaining a lower water to cement ratio is crucial for achieving higher strength.
    • Mature Concrete (Age > 40 days): The impact of the water to cement ratio on strength diminishes with age. Concrete aged over 40 days shows relatively high strength across a broader range of water to cement ratios, indicating the reduced sensitivity to initial mix proportions as the concrete cures over time.
  • The contours plotted show specific strength levels, with the highest strengths (around 56.27) observed at low water to cement ratios and higher ages, underscoring the dual benefit of mature age and optimal mix design in achieving peak concrete strength.

This plot underscores the importance of considering both age and mix proportions in predicting and optimizing concrete strength, highlighting how interactions between these factors can significantly influence outcomes in real-world applications.

Gradient Boosting Regressor for Regression Task¶

In [86]:
from sklearn.ensemble import GradientBoostingRegressor

Training and Prediction with Gradient Boosting Regressor¶

In [87]:
from sklearn.ensemble import GradientBoostingRegressor

# Initialize Gradient Boosting Regressor
gb_reg = GradientBoostingRegressor(n_estimators=100, learning_rate=0.1, random_state=42)  # Adjust hyperparameters as needed

# Train the model on training data
gb_reg.fit(X_train, y_train)

# Make predictions on the scaled test set
y_pred_gb_reg = gb_reg.predict(X_test)

Evaluating Gradient Boosting Regressor: Mean Squared Error Calculation¶

In [88]:
from sklearn.metrics import mean_squared_error, r2_score

# Evaluate Mean Squared Error (MSE)
mse_gb_reg = mean_squared_error(y_test, y_pred_gb_reg)
print(f"Mean Squared Error: {mse_gb_reg}")

# Evaluate R-squared value
r2_gb_reg = r2_score(y_test, y_pred_gb_reg)
print(f"R-squared: {r2_gb_reg}")
Mean Squared Error: 23.578556802805835
R-squared: 0.90000971439312
In [89]:
print(X_train)
      Blast Furnace Slag  Fly Ash  Superplasticizer  Age  Water_Cement_Ratio  \
1007               243.5      0.0              10.7   28            1.158740   
138                189.0      0.0               9.5   28            0.517609   
587                158.8      0.0               0.0   28            0.779597   
924                196.0     98.0               6.0   28            1.463235   
541                  0.0      0.0               0.0    3            0.576577   
...                  ...      ...               ...  ...                 ...   
189                  0.0     94.6               4.6    3            0.846450   
237                 98.1     24.5               6.7   56            0.849860   
432                  0.0    143.6               0.0   28            0.992727   
632                  0.0      0.0               0.0   28            0.566154   
232                 98.1     24.5               6.9   56            0.850257   

      Coarse_Fine_Ratio  
1007           1.464813  
138            1.249934  
587            1.417132  
924            1.081737  
541            1.105151  
...                 ...  
189            1.111241  
237            1.357097  
432            1.116217  
632            1.357599  
232            1.357016  

[620 rows x 6 columns]
In [90]:
# from skopt import gp_minimize
# from skopt.space import Real, Categorical, Integer
# from skopt.utils import use_named_args
# from sklearn.ensemble import GradientBoostingRegressor
# from sklearn.metrics import mean_squared_error
# from sklearn.model_selection import cross_val_score

# # Define the space of hyperparameters to search
# space = [
#     Integer(100, 500, name='n_estimators'),
#     Real(0.01, 0.5, name='learning_rate'),
#     Integer(1, 10, name='max_depth'),
#     Real(0.01, 0.5, name='subsample'),
#     Real(0.1, 0.5, name='min_samples_split')
# ]

# # Objective function to minimize
# @use_named_args(space)
# def objective(**params):
#     best_gb_reg = GradientBoostingRegressor(random_state=42, **params)
#     # Use negative mean squared error as the score to minimize
#     return -np.mean(cross_val_score(best_gb_reg, X_train, y_train, cv=5, n_jobs=-1, scoring="neg_mean_squared_error"))

# # Run Bayesian optimization
# result = gp_minimize(objective, space, n_calls=50, random_state=42)

# # Print the best parameters found
# print("Best parameters:", result.x)
# print("Best MSE:", -result.fun)

# # Optionally, fit the model with the best parameters
# best_gb_reg = GradientBoostingRegressor(random_state=42, **{dim.name: val for dim, val in zip(space, result.x)})
# best_gb_reg.fit(X_train, y_train)
# y_pred = best_gb_reg.predict(X_test)
# mse = mean_squared_error(y_test, y_pred)
# r2 = r2_score(y_test, y_pred)

# print("Mean Squared Error:", mse)
# print("R-squared:", r2)
In [91]:
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import pandas as pd

# Assuming X and y are already defined
# X is your feature set (as a pandas DataFrame)
# y is your target variable

# Ranked feature importance order from your best Gradient Boosting Regressor
ranked_features = [
    'Water_Cement_Ratio',
    'Age',
    'Superplasticizer',
    'Blast Furnace Slag',
    'Coarse_Fine_Ratio',
    'Fly Ash'
]

# Split data into training and testing sets once, before the loop
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Perform the ablation study
num_features_list = []
r2_scores = []
mse_scores = []
features_used_list = []

# Store the original model state before ablation study
best_gb_reg_copy = GradientBoostingRegressor(
    n_estimators=500,
    learning_rate=0.20566563644514926,
    max_depth=10,
    subsample=0.5,
    min_samples_split=0.24222084612038797,
    random_state=42
)

# Iterate over ranked features and perform ablation study
for i in range(len(ranked_features), 0, -1):
    selected_features = ranked_features[:i]

    # Use local variables to avoid global changes to data
    local_X_train = X_train[selected_features].copy()
    local_X_test = X_test[selected_features].copy()

    # Initialize Gradient Boosting Regressor model with the best parameters for ablation
    local_gb_reg = GradientBoostingRegressor(
        n_estimators=500,
        learning_rate=0.20566563644514926,
        max_depth=10,
        subsample=0.5,
        min_samples_split=0.24222084612038797,
        random_state=42
    )

    # Train the model
    local_gb_reg.fit(local_X_train, y_train)

    # Make predictions and calculate R-squared and MSE
    y_pred = local_gb_reg.predict(local_X_test)
    r2 = r2_score(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)

    # Store results
    num_features_list.append(i)
    r2_scores.append(r2)
    mse_scores.append(mse)
    features_used_list.append(selected_features)

# Plotting the R-squared values vs number of features with high DPI
plt.figure(figsize=(10, 6), dpi=1200)  # Set DPI to 1200
plt.plot(
    num_features_list,
    r2_scores,
    marker='s',
    markersize=8,
    linewidth=2,
    color='darkblue'
)
plt.xlabel('Number of Features', fontsize=18)  # Increased fontsize
plt.ylabel('R-Squared Value', fontsize=18)    # Increased fontsize
plt.title(
    'Ablation Study of R-Squared Values with Varying Feature Counts',
    fontsize=20,  # Increased fontsize
    fontweight='bold'
)
plt.gca().invert_xaxis()

# Set x and y ticks with increased fontsize
plt.xticks(num_features_list, [f'{i} features' for i in num_features_list], fontsize=14)
plt.yticks(fontsize=14)

plt.grid(True, which='both', linestyle='--', linewidth=0.5)
plt.tight_layout()  # Adjust layout to prevent clipping

# Optionally, save the figure with high DPI
# plt.savefig('ablation_study_high_res.png', dpi=1200)

plt.show()

# Output the R-squared values and features kept to verify along with MSE
for num_features, r2, mse, features in zip(num_features_list, r2_scores, mse_scores, features_used_list):
    print(f"Number of features: {num_features}, R-Squared: {r2:.4f}, MSE: {mse:.4f}, Features used: {features}")

# Restore the original best_gb_reg model by refitting it after ablation study
best_gb_reg = best_gb_reg_copy  # Restoring the original model
best_gb_reg.fit(X_train, y_train)  # Refit the original model on full training data

# Make predictions on the test set
y_pred_full = best_gb_reg.predict(X_test)

# Calculate R-squared and MSE for the full model
r_squared_full = r2_score(y_test, y_pred_full)
mse_full = mean_squared_error(y_test, y_pred_full)
print(f"R-squared with all features: {r_squared_full:.4f}, Mean Squared Error with all features: {mse_full:.4f}")
No description has been provided for this image
Number of features: 6, R-Squared: 0.9287, MSE: 16.8225, Features used: ['Water_Cement_Ratio', 'Age', 'Superplasticizer', 'Blast Furnace Slag', 'Coarse_Fine_Ratio', 'Fly Ash']
Number of features: 5, R-Squared: 0.9215, MSE: 18.5024, Features used: ['Water_Cement_Ratio', 'Age', 'Superplasticizer', 'Blast Furnace Slag', 'Coarse_Fine_Ratio']
Number of features: 4, R-Squared: 0.9319, MSE: 16.0582, Features used: ['Water_Cement_Ratio', 'Age', 'Superplasticizer', 'Blast Furnace Slag']
Number of features: 3, R-Squared: 0.8097, MSE: 44.8791, Features used: ['Water_Cement_Ratio', 'Age', 'Superplasticizer']
Number of features: 2, R-Squared: 0.7463, MSE: 59.8168, Features used: ['Water_Cement_Ratio', 'Age']
Number of features: 1, R-Squared: 0.1786, MSE: 193.6838, Features used: ['Water_Cement_Ratio']
R-squared with all features: 0.9287, Mean Squared Error with all features: 16.8225

SHAP Summary Plot¶

In [92]:
import shap
import matplotlib.pyplot as plt

# Assuming best_rf_model is already defined and trained
# Initialize the SHAP explainer using the trained Random Forest model
explainer = shap.TreeExplainer(best_gb_reg)

# Calculate SHAP values for the dataset used with the model
# For regression models, shap_values will be a 2D array of shape (n_samples, n_features)
shap_values = explainer.shap_values(X)

# Alternatively, for newer versions of SHAP, you can use:
# shap_values = explainer(X)

# Set global font properties for consistency with increased sizes
plt.rcParams.update({
    'font.size': 20,                # Default font size increased from 18 to 20
    'axes.titlesize': 26,           # Title font size increased from 24 to 26
    'axes.labelsize': 22,           # Axis label font size increased from 20 to 22
    'xtick.labelsize': 18,          # X-axis tick label font size increased from 16 to 18
    'ytick.labelsize': 18,          # Y-axis tick label font size increased from 16 to 18
    'legend.fontsize': 20,          # Legend font size increased from 18 to 20
    'font.weight': 'bold',          # Default font weight remains bold
    'axes.titleweight': 'bold',     # Title font weight remains bold
    'axes.labelweight': 'bold',     # Axis label font weight remains bold
    'axes.grid': True,              # Enable grid
    'grid.linestyle': '--',         # Grid line style
    'grid.linewidth': 0.7,          # Grid line width
    'axes.edgecolor': 'black',      # Axis edge color
    'axes.linewidth': 1.2,          # Axis line width
    'figure.titlesize': 26,         # Figure title size increased from 24 to 26
    'figure.titleweight': 'bold'    # Figure title weight remains bold
})

# Create the SHAP summary plot without displaying it immediately
# Capture the matplotlib figure and axes objects
plt.figure(figsize=(12, 8), dpi=1200)  # Increased figure size for better readability
shap.summary_plot(shap_values, X, show=False, plot_size=(12, 8))

# Get the current figure and axes
fig = plt.gcf()
ax = plt.gca()

# Customize the SHAP plot for a professional journal look with increased font sizes
ax.set_title('SHAP Summary Plot of Feature Importance', fontsize=26, fontweight='bold', fontname='Arial')  # Increased fontsize from 24 to 26
ax.set_xlabel('SHAP Value (Impact on Output)', fontsize=22, fontweight='bold', fontname='Arial')          # Increased fontsize from 20 to 22
ax.set_ylabel('Features', fontsize=22, fontweight='bold', fontname='Arial')                              # Increased fontsize from 20 to 22

# Customize ticks with increased font sizes
ax.tick_params(axis='both', which='major', labelsize=18, labelcolor='black')  # Increased labelsize from 16 to 18

# Make tick labels bold and set font
for label in ax.get_xticklabels():
    label.set_fontname('Arial')
    label.set_fontweight('bold')
for label in ax.get_yticklabels():
    label.set_fontname('Arial')
    label.set_fontweight('bold')

# Access and customize the color bar (legend)
# SHAP typically adds a color bar as the last axis in the figure
# Iterate through all axes to find the color bar
cbar = None
for axis in fig.axes:
    if axis != ax:
        cbar = axis
        break

if cbar:
    # Set the label font properties with increased size
    cbar.set_ylabel('Feature Value', fontsize=22, fontweight='bold', fontname='Arial')  # Increased fontsize from 20 to 22
    # Set tick labels to be bold and larger
    for label in cbar.get_yticklabels():
        label.set_fontname('Arial')
        label.set_fontweight('bold')
        label.set_fontsize(18)  # Increased fontsize from 16 to 18
    # Set color bar label properties
    cbar.tick_params(labelsize=18)  # Increased labelsize from 16 to 18
    # Set color bar title (if any) to bold and larger
    if cbar.get_title():
        cbar.set_title(cbar.get_title().get_text(), fontweight='bold', fontsize=22, fontname='Arial')  # Increased fontsize from 20 to 22

# Optional: Adjust layout to ensure everything fits without overlapping
plt.tight_layout()

# Save the figure with a high DPI for journal publication
plt.savefig('shap_summary_plot_high_res.png', dpi=1200, bbox_inches='tight')

# Display the customized plot
plt.show()
No description has been provided for this image
In [93]:
# Select the instance to analyze
instance_index = 0

# Print feature values for the selected instance
print("Feature values for the selected instance:")
print(X.iloc[instance_index])

# Print SHAP values for the selected instance
print("\nSHAP values for the selected instance:")
print(shap_values[instance_index])
Feature values for the selected instance:
Blast Furnace Slag     0.000000
Fly Ash                0.000000
Superplasticizer       2.500000
Age                   28.000000
Water_Cement_Ratio     0.300000
Coarse_Fine_Ratio      1.560651
Name: 1, dtype: float64

SHAP values for the selected instance:
[-3.9569442  -2.21740219 -1.29945267  5.67020881 32.05044899  0.58360727]
In [94]:
# Assuming y is your target variable (e.g., a pandas Series or NumPy array)

# For pandas Series
real_y_value = y.iloc[instance_index]

# For NumPy array or list
# real_y_value = y[instance_index]

print(f"The actual target value for instance {instance_index} is: {real_y_value}")
The actual target value for instance 0 is: 61.89
In [95]:
import shap
import matplotlib.pyplot as plt

# Initialize the SHAP explainer using the trained Gradient Boosting model
explainer = shap.TreeExplainer(best_gb_reg)

# Calculate SHAP values for the dataset used with the model
shap_values = explainer.shap_values(X)

# Select an instance to visualize (you can change the index)
instance_index = 0  # For the first instance in the dataset

# Create a Force Plot for the selected instance
plt.figure(figsize=(10, 6))
shap.initjs()  # Initialize JavaScript for rendering

# Use the force_plot function to visualize the SHAP values for the selected instance
shap.force_plot(explainer.expected_value, shap_values[instance_index], X.iloc[instance_index], matplotlib=True, show=False)

# Save the Force Plot to a PNG file
plt.savefig('force_plot.png', bbox_inches='tight')  # Save as a PNG file

# Save the Force Plot to an HTML file
shap.save_html('force_plot.html', shap.force_plot(explainer.expected_value, shap_values[instance_index], X.iloc[instance_index]))  # Save as an HTML file

# Show the plot (optional if you want to see it interactively)
plt.show()
No description has been provided for this image
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image

Water Plot¶

In [96]:
import shap
import matplotlib.pyplot as plt

# Initialize the SHAP explainer using the trained Gradient Boosting model
explainer = shap.Explainer(best_gb_reg, X)

# Calculate SHAP values for the dataset used with the model
shap_values = explainer(X, check_additivity=False)  # Disable additivity check

# Select an instance to visualize (you can change the index)
instance_index = 0  # For the first instance in the dataset

# Extract the SHAP values for the selected instance
shap_value = shap_values[instance_index]

# Create a Waterfall Plot for the selected instance
shap.plots.waterfall(shap_value, max_display=10, show=False)

# Save the Waterfall Plot to a PNG file
plt.savefig('waterfall_plot.png', bbox_inches='tight')  # Save as a PNG file

# Show the plot
plt.show()
No description has been provided for this image

Dependece Plot¶

In [97]:
import shap
import matplotlib.pyplot as plt
import pandas as pd

# Ensure X_test_scaled is a DataFrame with feature names
# If it's already a DataFrame, you can skip this step
# Replace 'X' with your original feature DataFrame if needed
feature_names = X.columns.tolist()  # Assuming 'X' is your original DataFrame
X_test_df = pd.DataFrame(X_test, columns=feature_names)

# Verify that 'Water_Cement_Ratio' exists in the DataFrame
if 'Water_Cement_Ratio' not in X_test_df.columns:
    raise ValueError("Feature 'Water_Cement_Ratio' not found in X_test_scaled.")

# Initialize the SHAP explainer using the trained Gradient Boosting model
explainer = shap.TreeExplainer(best_gb_reg)

# Calculate SHAP values for the scaled test set
shap_values = explainer.shap_values(X_test_df)

# Select a feature for the dependence plot
feature_name = 'Water_Cement_Ratio'  # Ensure this feature exists

# Create the dependence plot
plt.figure(figsize=(10, 6))
shap.dependence_plot(
    feature_name, 
    shap_values, 
    X_test_df, 
    feature_names=feature_names,  # Optional: explicitly pass feature names
    show=False  # Prevent immediate display
)

# Customize the title and show the plot
plt.title(f'Dependence Plot for {feature_name}', fontsize=16, fontweight='bold')
plt.show()
<Figure size 1000x600 with 0 Axes>
No description has been provided for this image
In [98]:
import shap
import matplotlib.pyplot as plt
import pandas as pd

# ==============================
# Step 1: Prepare the Scaled Test Data
# ==============================

# Assuming you have X_test available from your preprocessing steps
# and X is your original feature DataFrame.

# Replace 'X' with your actual original feature DataFrame variable if different
feature_names = X.columns.tolist()

# Convert X_test to a pandas DataFrame with feature names
X_test_df = pd.DataFrame(X_test, columns=feature_names)

# ==============================
# Step 2: Initialize the SHAP Explainer
# ==============================

# Initialize the SHAP TreeExplainer with the trained Gradient Boosting model
explainer = shap.TreeExplainer(best_gb_reg)

# ==============================
# Step 3: Calculate SHAP Values Using Scaled Data
# ==============================

# Calculate SHAP values for the scaled test dataset
# For regression problems, shap_values will be a 2D array (n_samples, n_features)
shap_values = explainer.shap_values(X_test_df)

# ==============================
# Step 4: Verify Feature Names
# ==============================

# Ensure that the feature you want to plot exists in the DataFrame
feature_to_plot = 'Water_Cement_Ratio'  # Change this to your desired feature

if feature_to_plot not in X_test_df.columns:
    raise ValueError(f"Feature '{feature_to_plot}' not found in the scaled test dataset.")

# ==============================
# Step 5: Create SHAP Dependence Plots for All Feature Pairs
# ==============================

# Get all feature names
features = X_test_df.columns
num_features = len(features)

# Initialize a figure with subplots
# Note: Creating a grid of dependence plots for every feature pair can be very intensive
# and may result in a cluttered visualization, especially with a large number of features.
# Consider focusing on a subset of important features for clarity.

fig, axs = plt.subplots(num_features, num_features, figsize=(20, 20))

# Iterate through each pair of features to create dependence plots
for i in range(num_features):
    for j in range(num_features):
        if i != j:
            # Create the dependence plot for feature i, colored by feature j
            shap.dependence_plot(
                features[i],                      # Feature to plot on the x-axis
                shap_values,                      # SHAP values
                X_test_df,                 # Dataset
                interaction_index=features[j],    # Feature to color by
                ax=axs[i, j],                     # Specific subplot axis
                show=False                         # Prevent immediate display
            )
        else:
            # Hide the diagonal plots where feature pairs are identical
            axs[i, j].axis('off')

# Adjust layout and add a super title
plt.suptitle('SHAP Dependence Plots for Feature Combinations', fontsize=20)
plt.tight_layout()
plt.subplots_adjust(top=0.95)  # Adjust the top to fit the super title

# Display the plots
plt.show()
No description has been provided for this image
In [99]:
import shap
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap  # Import ListedColormap
import numpy as np
import pandas as pd

# ==============================
# Step 1: Prepare the Scaled Test Data
# ==============================

# Assuming you have X_test_scaled available from your preprocessing steps
# and X is your original feature DataFrame.

# Replace 'X' with your actual original feature DataFrame variable if different
feature_names = X.columns.tolist()

# Convert X_test_scaled to a pandas DataFrame with feature names
X_test_df = pd.DataFrame(X_test, columns=feature_names)

# ==============================
# Step 2: Initialize the SHAP Explainer
# ==============================

# Initialize the SHAP TreeExplainer with the trained Gradient Boosting model
explainer = shap.TreeExplainer(best_gb_reg)

# ==============================
# Step 3: Calculate SHAP Values Using Scaled Data
# ==============================

# Calculate SHAP values for the scaled test dataset
# For regression problems, shap_values will be a 2D array (n_samples, n_features)
shap_values = explainer.shap_values(X_test_df)

# ==============================
# Step 4: Verify Feature Names and Handle SHAP Values
# ==============================

# Ensure that 'Water_Cement_Ratio' exists in the DataFrame if needed
feature_to_plot = 'Water_Cement_Ratio'  # Change this to your desired feature

if feature_to_plot not in X_test_df.columns:
    raise ValueError(f"Feature '{feature_to_plot}' not found in the scaled test dataset.")

# Note: The following condition is typically used for classification problems.
# For regression, shap_values should be a 2D array, so this check might not be necessary.
# However, keeping it ensures compatibility in case of any unexpected SHAP value structures.
if isinstance(shap_values, list):
    shap_values = shap_values[0]  # Use the SHAP values for the first class

# ==============================
# Step 5: Define a Custom Colormap
# ==============================

# Define a custom colormap with colors ordered from low to high values
# Adjust the colors as needed to represent your data effectively
colors_list = ['green', 'blue', 'orange', 'red']  # Reordered colors
cmap = ListedColormap(colors_list)

# ==============================
# Step 6: Create SHAP Dependence Plots for All Feature Pairs
# ==============================

# Get all feature names
features = X_test_df.columns
num_features = len(features)

# Initialize a figure with subplots
# Creating a grid of dependence plots for every feature pair can be very intensive
# and may result in a cluttered visualization, especially with a large number of features.
# Consider focusing on a subset of important features for clarity.
fig, axs = plt.subplots(num_features, num_features, figsize=(20, 20))

# Iterate through each pair of features to create dependence plots
for i in range(num_features):
    for j in range(num_features):
        if i != j:
            # Create the dependence plot for feature i, colored by feature j
            shap.dependence_plot(
                features[i],                      # Feature to plot on the x-axis
                shap_values,                      # SHAP values
                X_test_df,                 # Dataset
                interaction_index=features[j],    # Feature to color by
                ax=axs[i, j],                     # Specific subplot axis
                show=False,                       # Prevent immediate display
                cmap=cmap                         # Use the custom colormap
            )
        else:
            # Hide the diagonal plots where feature pairs are identical
            axs[i, j].axis('off')

# Adjust layout and add a super title
plt.suptitle('SHAP Dependence Plots for Feature Combinations', fontsize=20)
plt.tight_layout()
plt.subplots_adjust(top=0.95)  # Adjust the top to fit the super title

# Display the plots
plt.show()
No description has been provided for this image

Bar Plot¶

In [100]:
import shap
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# ==============================
# Step 1: Prepare the Scaled Test Data
# ==============================

# Assuming you have X_test available from your preprocessing steps
# and X is your original feature DataFrame.

# Replace 'X' with your actual original feature DataFrame variable if different
feature_names = X.columns.tolist()

# Convert X_test to a pandas DataFrame with feature names
X_test_df = pd.DataFrame(X_test, columns=feature_names)

# ==============================
# Step 2: Initialize the SHAP Explainer
# ==============================

# Initialize the SHAP TreeExplainer with the trained Random Forest model
explainer = shap.TreeExplainer(best_gb_reg)

# ==============================
# Step 3: Calculate SHAP Values Using Scaled Data
# ==============================

# Calculate SHAP values for the scaled test dataset
# For regression problems, shap_values will be a 2D array (n_samples, n_features)
# For classification, it could be a list of arrays (one per class)
shap_values = explainer.shap_values(X_test_df)

# ==============================
# Step 4: Handle SHAP Values for Classification (If Applicable)
# ==============================

# If shap_values is a list (applicable for classification problems), select one class
if isinstance(shap_values, list):
    shap_values = shap_values[0]  # Use the SHAP values for the first class

# ==============================
# Step 5: Compute Mean Absolute SHAP Values (Optional)
# ==============================

# Compute the mean absolute SHAP values for each feature (optional)
# This step is optional as shap.summary_plot with plot_type="bar" already computes and plots mean absolute SHAP values
mean_shap_values = np.abs(shap_values).mean(axis=0)

# ==============================
# Step 6: Create SHAP Summary Bar Plot
# ==============================

# Create a bar plot for the mean absolute SHAP values
plt.figure(figsize=(10, 6))
shap.summary_plot(
    shap_values,
    X_test_df,
    plot_type="bar",
    show=False  # Prevent immediate display to allow for customization
)

# ==============================
# Step 7: Customize and Display the Plot
# ==============================

# Customize the plot (optional)
plt.title('SHAP Summary Bar Plot for Random Forest Model', fontsize=16, fontweight='bold')
plt.xlabel('Mean |SHAP value|', fontsize=14)
plt.ylabel('Features', fontsize=14)

# Show the plot
plt.show()
No description has been provided for this image

Partial Depndence Plot¶

In [101]:
from sklearn.inspection import PartialDependenceDisplay
import matplotlib.pyplot as plt
import pandas as pd

# ==============================
# Step 1: Prepare the Scaled Test Data
# ==============================

# Assuming you have X_test available from your preprocessing steps
# and X is your original feature DataFrame.

# Replace 'X' with your actual original feature DataFrame variable if different
feature_names = X.columns.tolist()

# Convert X_test to a pandas DataFrame with feature names
X_test_df = pd.DataFrame(X_test, columns=feature_names)

# ==============================
# Step 2: Initialize the Partial Dependence Display
# ==============================

# Define the features to inspect
features_to_inspect = ['Water_Cement_Ratio', 'Age']  # Single feature or a pair of features

# Initialize a figure for the plot
fig, ax = plt.subplots(figsize=(12, 8))

# ==============================
# Step 3: Generate Partial Dependence Plots Using Scaled Data
# ==============================

# Create the Partial Dependence Plot using the scaled test dataset
PartialDependenceDisplay.from_estimator(
    best_gb_reg,          # Your trained Gradient Boosting model
    X_test_df,    # Scaled dataset consistent with model training
    features=features_to_inspect,  # Features to plot
    ax=ax,                # Axes object to plot on
    grid_resolution=50    # Number of points on the x-axis grid
)

# ==============================
# Step 4: Enhance Plot Aesthetics
# ==============================

# Customize the plot title and labels
ax.set_title('Partial Dependence Plot', fontsize=16)
ax.set_xlabel('Feature Value', fontsize=14)
ax.set_ylabel('Partial Dependence', fontsize=14)

# Adjust layout to prevent overlap with the title
plt.subplots_adjust(top=0.9)

# ==============================
# Step 5: Display the Plot
# ==============================

plt.show()
No description has been provided for this image
In [102]:
# Print column names of the original DataFrame
print("Original DataFrame column names:")
print(X.columns.tolist())
Original DataFrame column names:
['Blast Furnace Slag', 'Fly Ash', 'Superplasticizer', 'Age', 'Water_Cement_Ratio', 'Coarse_Fine_Ratio']
In [103]:
# Print column names of the scaled DataFrame
print("\nScaled DataFrame column names:")
print(X_test_df.columns.tolist())
Scaled DataFrame column names:
['Blast Furnace Slag', 'Fly Ash', 'Superplasticizer', 'Age', 'Water_Cement_Ratio', 'Coarse_Fine_Ratio']
In [104]:
from sklearn.inspection import PartialDependenceDisplay
import matplotlib.pyplot as plt

# Features to inspect (as a tuple for interaction)
features = [('Water_Cement_Ratio', 'Age')]

# Adjust percentiles to include the entire range
display = PartialDependenceDisplay.from_estimator(
    best_gb_reg,        # Replace with your Gradient Boosting model
    X,
    features=features,
    grid_resolution=50,
    percentiles=(0, 1),  # Include the whole range of both features
    kind='average',      # 'average' for regression models
)

# Enhancing the plot aesthetics
plt.suptitle('Partial Dependence Plot (Interaction)', fontsize=16)
plt.subplots_adjust(top=0.9)

plt.show()
No description has been provided for this image

Gradient Boosting Regressor Feature Ranking¶

In [105]:
# Verify the features expected by the model and available in X
print("Model trained with n_features:", best_gb_reg.n_features_in_)
print("Features in X:", X.shape[1])
print("Columns in X:", X.columns)
Model trained with n_features: 6
Features in X: 6
Columns in X: Index(['Blast Furnace Slag', 'Fly Ash', 'Superplasticizer', 'Age',
       'Water_Cement_Ratio', 'Coarse_Fine_Ratio'],
      dtype='object')
In [106]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# Assuming best_gb_reg is your trained Gradient Boosting Regressor
feature_importances = best_gb_reg.feature_importances_

# Create a pandas series with feature names and their importance
features = pd.Series(feature_importances, index=X.columns)

# Sort the features based on importance
features_sorted = features.sort_values(ascending=True)

# Plotting
plt.figure(figsize=(10, 6))
features_sorted.plot(kind='barh')  # Use 'barh' for horizontal bar plot
plt.title('Feature Importance Ranking')
plt.xlabel('Importance')
plt.ylabel('Features')
plt.show()

# Optionally, print the ranked features
print("Ranked Features:\n", features_sorted)
No description has been provided for this image
Ranked Features:
 Fly Ash               0.030412
Coarse_Fine_Ratio     0.049814
Superplasticizer      0.088205
Blast Furnace Slag    0.131163
Age                   0.270555
Water_Cement_Ratio    0.429851
dtype: float64
In [107]:
# Verify that X_test has the same number of features as expected by the model
print("Features expected by model:", best_gb_reg.n_features_in_)
print("Features in X_test:", X_test.shape[1])
Features expected by model: 6
Features in X_test: 6

Optimization Report for Gradient Boosting Regressor¶

Objective¶

The goal was to optimize the hyperparameters of a Gradient Boosting Regressor to minimize the Mean Squared Error (MSE) on a validation dataset through Bayesian optimization.

Methodology¶

  • Model: Gradient Boosting Regressor

  • Hyperparameters Tuned:

    • n_estimators: Number of boosting stages to perform. More stages increase the model's complexity.
    • learning_rate: Shrinks the contribution of each tree by the learning rate.
    • max_depth: Maximum depth of the individual regression estimators.
    • subsample: The fraction of samples to be used for fitting the individual base learners.
    • min_samples_split: The minimum number of samples required to split an internal node.
  • Optimization Technique: Bayesian Optimization using Gaussian Processes.

Results¶

  • Best Hyperparameters:

    • n_estimators: 500
    • learning_rate: 0.206
    • max_depth: 10
    • subsample: 0.50
    • min_samples_split: 0.242
  • Performance:

    • Best MSE (Cross-validation): 18.2414
    • MSE on Test Set: 15.7326
    • R-squared on Test Set: 0.9397

Conclusion¶

The optimized Gradient Boosting Regressor demonstrates a high degree of predictive accuracy, as evidenced by the low MSE and high R-squared value on the test set. This indicates a strong fit to the data, capable of capturing most of the variability in the target variable. The results suggest that the model, with the optimized parameters, could be highly effective for predictive tasks within this domain.

Future Steps¶

Further testing with external datasets and potential integration into a production pipeline could be considered to validate the model's effectiveness in operational settings. Additionally, experimenting with other forms of ensemble learning could provide comparative insights into model performance and robustness.

Distribution of Residuals Analysis¶

In [108]:
import seaborn as sns
import matplotlib.pyplot as plt

# Convert X_test to a NumPy array to match the format used during training
X_test_np = X_test.values

# Calculate predictions using the Gradient Boosting Regressor
y_pred_gb = best_gb_reg.predict(X_test_np)

# Calculate residuals
residuals_gb = y_test - y_pred_gb

# Plot the residual distribution
sns.histplot(residuals_gb, kde=True, color='blue')  # Added color for distinction
plt.xlabel('Residuals')
plt.ylabel('Density')
plt.title('Density Plot of Residuals for Gradient Boosting Regressor')
plt.show()
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/base.py:493: UserWarning: X does not have valid feature names, but GradientBoostingRegressor was fitted with feature names
  warnings.warn(
No description has been provided for this image

Insights from the Residuals Density Plot¶

  • The residuals are skewed, with the majority concentrated on the left side of 0, indicating a tendency of the model to under-predict.
  • Most residuals fall between -10 and -30, suggesting that the model systematically predicts values lower than the true values.
  • The distribution is not perfectly normal, as evidenced by the skew towards negative residuals.
  • There are some larger positive residuals, indicating that the model occasionally over-predicts, though these instances are fewer.
  • The overall shape of the plot suggests the model may benefit from further tuning to reduce prediction bias.

AdaBoost Regressor for Regression Task¶

In [109]:
from sklearn.ensemble import AdaBoostRegressor

Training AdaBoost Regressor with Decision Tree Base Estimator¶

In [110]:
# Assuming X_train_scaled and X_test_scaled are already scaled pandas DataFrames
# Initialize Decision Tree Regressor (as the base estimator for AdaBoost)
base_estimator = DecisionTreeRegressor(max_depth=8)

# Initialize AdaBoost Regressor
adaboost_reg = AdaBoostRegressor(estimator=base_estimator, n_estimators=100, learning_rate=0.01, random_state=42)

# Ensure you pass pandas DataFrames with column names to fit the model
adaboost_reg.fit(X_train_scaled, y_train)

# Make predictions on the scaled test set (ensure it's a DataFrame too)
y_pred_ada_regressor = adaboost_reg.predict(X_test_scaled)

Evaluating AdaBoost Regressor: Mean Squared Error and Actual vs. Predicted Values Comparison¶

In [111]:
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Evaluate Mean Squared Error (MSE)
mse_ada_regressor = mean_squared_error(y_test, y_pred_ada_regressor)
r2_ada_regressor = r2_score(y_test, y_pred_ada_regressor)
print(f"Mean Squared Error (AdaBoost Regressor): {mse_ada_regressor}")
print(f"R-squared (AdaBoost Regressor): {r2_ada_regressor}")
Mean Squared Error (AdaBoost Regressor): 27.19252134620052
R-squared (AdaBoost Regressor): 0.8846838676973559
In [112]:
from skopt import gp_minimize
from skopt.space import Real, Integer
from skopt.utils import use_named_args
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.model_selection import cross_val_score
import numpy as np

# Define the space of hyperparameters to search
space = [
    Integer(50, 500, name='n_estimators'),  # Number of trees
    Real(0.01, 0.5, name='learning_rate'),  # Learning rate
    Integer(1, 10, name='max_depth')  # Max depth of the base estimator (Decision Tree)
]

# Objective function to minimize
@use_named_args(space)
def objective(**params):
    # Initialize the base estimator (Decision Tree) with variable max depth
    base_estimator = DecisionTreeRegressor(max_depth=params['max_depth'])
    
    # Initialize AdaBoost Regressor with the current set of parameters
    adaboost_reg = AdaBoostRegressor(
        estimator=base_estimator, 
        n_estimators=params['n_estimators'], 
        learning_rate=params['learning_rate'], 
        random_state=42
    )
    
    # Use negative mean squared error as the score to minimize
    return -np.mean(cross_val_score(adaboost_reg, X_train_scaled, y_train, cv=5, n_jobs=-1, scoring="neg_mean_squared_error"))

# Run Bayesian optimization
result = gp_minimize(objective, space, n_calls=50, random_state=42)

# Print the best parameters found
print("Best parameters:", result.x)
print("Best MSE:", -result.fun)

# Optionally, fit the model with the best parameters and assign it to `best_adaboost_reg`
best_base_estimator = DecisionTreeRegressor(max_depth=result.x[2])

best_adaboost_reg = AdaBoostRegressor(
    estimator=best_base_estimator,
    n_estimators=result.x[0],
    learning_rate=result.x[1],
    random_state=42
)

# Fit the model to the training data
best_adaboost_reg.fit(X_train_scaled, y_train)

# Make predictions on the test set and calculate R-squared
y_pred_best_ada = best_adaboost_reg.predict(X_test_scaled)
mse_best_ada = mean_squared_error(y_test, y_pred_best_ada)
r2_best_ada = r2_score(y_test, y_pred_best_ada)

print(f"Mean Squared Error (Best AdaBoost Regressor): {mse_best_ada}")
print(f"R-squared (Best AdaBoost Regressor): {r2_best_ada}")
Best parameters: [220, 0.15189014673544077, 10]
Best MSE: -33.329804704724594
Mean Squared Error (Best AdaBoost Regressor): 23.27260223412654
R-squared (Best AdaBoost Regressor): 0.9013071850127524
In [113]:
# Make predictions with the best AdaBoost model
y_pred_best_adaboost = best_adaboost_reg.predict(X_test_scaled)

# Plot the predicted vs. actual values
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred_best_adaboost, color='blue', label='Predicted Values')
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', label='Ideal Line')
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values (Best AdaBoost Regressor)')
plt.legend()
plt.grid(True)
plt.show()
No description has been provided for this image

The model shows a strong predictive ability, although there are instances, particularly in the mid-range of actual values, where the predictions are less accurate, as seen by the spread of points around the line. Overall, the AdaBoost Regressor appears to perform effectively for this regression task.

Best AdaBoost Regressor Results¶

After performing hyperparameter optimization using Bayesian optimization, the AdaBoost Regressor yielded the following best parameters:

  • Number of Estimators: 500
  • Learning Rate: 0.0791
  • Max Depth of Base Estimator: 10

Performance Metrics:

  • Mean Squared Error (MSE): 25.436
  • R-squared: 0.902

The following plot compares the actual vs. predicted values using the best AdaBoost Regressor model:

This plot shows a strong correlation between the actual and predicted values, indicating that the model is well-fit to the data with an R-squared value of 0.902, meaning the model explains approximately 90.2% of the variance in the target variable.

Importing TensorFlow and Keras for Deep Learning¶

In [114]:
import tensorflow as tf
from tensorflow import keras

Defining and Training a Neural Network Model for Regression¶

In [115]:
import tensorflow as tf

# Check if TensorFlow is set to use a GPU
physical_devices = tf.config.list_physical_devices('GPU')

if len(physical_devices) > 0:
    print("TensorFlow is using the GPU:")
    print(physical_devices)
else:
    print("TensorFlow is not using the GPU.")
TensorFlow is using the GPU:
[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]
In [116]:
import tensorflow as tf
from tensorflow import keras
from keras.layers import Dense, Dropout, BatchNormalization
from keras.regularizers import l2

# Define the neural network model with more layers, regularization, and dropout
model = keras.Sequential([
    keras.layers.Dense(128, activation='relu', input_shape=(X_train_scaled.shape[1],),
                       kernel_regularizer=l2(0.01)),  # First layer with regularization
    keras.layers.BatchNormalization(),  # Batch Normalization
    keras.layers.Dropout(0.3),  # Dropout to prevent overfitting
    keras.layers.Dense(64, activation='relu', kernel_regularizer=l2(0.01)),  # Second layer with L2 regularization
    keras.layers.BatchNormalization(),  # Batch Normalization
    keras.layers.Dropout(0.3),  # Dropout to prevent overfitting
    keras.layers.Dense(32, activation='relu', kernel_regularizer=l2(0.01)),  # Third layer
    keras.layers.Dense(1)  # Output layer with 1 neuron for regression
])

# Compile the model
model.compile(optimizer='RMSprop', loss='mean_squared_error')

# Train the model
history = model.fit(X_train_scaled, y_train, epochs=100, batch_size=32, 
                    validation_data=(X_test_scaled, y_test), verbose=2)

# Make predictions on the test set
y_pred_nn = model.predict(X_test_scaled).flatten()
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
2024-11-04 14:57:54.745734: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M1
2024-11-04 14:57:54.745804: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 8.00 GB
2024-11-04 14:57:54.745818: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 2.67 GB
2024-11-04 14:57:54.746078: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2024-11-04 14:57:54.746109: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)
Epoch 1/100
2024-11-04 14:57:55.607477: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.
20/20 - 3s - 161ms/step - loss: 1223.8672 - val_loss: 1283.0314
Epoch 2/100
20/20 - 0s - 19ms/step - loss: 1075.8778 - val_loss: 1244.4874
Epoch 3/100
20/20 - 0s - 19ms/step - loss: 928.7714 - val_loss: 1190.0981
Epoch 4/100
20/20 - 0s - 19ms/step - loss: 791.4333 - val_loss: 1112.4110
Epoch 5/100
20/20 - 0s - 18ms/step - loss: 652.7668 - val_loss: 1009.7191
Epoch 6/100
20/20 - 0s - 18ms/step - loss: 521.7047 - val_loss: 878.0530
Epoch 7/100
20/20 - 0s - 19ms/step - loss: 387.1394 - val_loss: 733.8263
Epoch 8/100
20/20 - 0s - 18ms/step - loss: 288.5801 - val_loss: 584.7815
Epoch 9/100
20/20 - 0s - 18ms/step - loss: 211.6013 - val_loss: 447.3894
Epoch 10/100
20/20 - 0s - 19ms/step - loss: 138.7987 - val_loss: 351.2259
Epoch 11/100
20/20 - 0s - 18ms/step - loss: 106.3824 - val_loss: 294.7450
Epoch 12/100
20/20 - 0s - 18ms/step - loss: 101.1113 - val_loss: 249.9865
Epoch 13/100
20/20 - 0s - 18ms/step - loss: 80.7331 - val_loss: 209.0912
Epoch 14/100
20/20 - 0s - 18ms/step - loss: 83.7363 - val_loss: 199.1244
Epoch 15/100
20/20 - 0s - 19ms/step - loss: 83.9023 - val_loss: 180.7177
Epoch 16/100
20/20 - 0s - 18ms/step - loss: 83.7553 - val_loss: 167.3735
Epoch 17/100
20/20 - 0s - 19ms/step - loss: 84.5573 - val_loss: 170.5496
Epoch 18/100
20/20 - 0s - 19ms/step - loss: 84.0364 - val_loss: 153.8028
Epoch 19/100
20/20 - 0s - 18ms/step - loss: 73.3295 - val_loss: 146.6174
Epoch 20/100
20/20 - 0s - 18ms/step - loss: 70.3517 - val_loss: 150.3872
Epoch 21/100
20/20 - 0s - 18ms/step - loss: 75.1052 - val_loss: 132.7004
Epoch 22/100
20/20 - 0s - 18ms/step - loss: 66.3796 - val_loss: 120.5198
Epoch 23/100
20/20 - 0s - 18ms/step - loss: 71.0013 - val_loss: 100.9101
Epoch 24/100
20/20 - 0s - 18ms/step - loss: 68.4097 - val_loss: 97.5618
Epoch 25/100
20/20 - 0s - 18ms/step - loss: 70.4435 - val_loss: 94.3384
Epoch 26/100
20/20 - 0s - 18ms/step - loss: 66.9766 - val_loss: 70.0855
Epoch 27/100
20/20 - 0s - 19ms/step - loss: 70.2837 - val_loss: 68.5814
Epoch 28/100
20/20 - 0s - 19ms/step - loss: 69.9031 - val_loss: 63.4971
Epoch 29/100
20/20 - 0s - 19ms/step - loss: 67.7673 - val_loss: 60.8022
Epoch 30/100
20/20 - 0s - 18ms/step - loss: 56.6530 - val_loss: 54.1987
Epoch 31/100
20/20 - 0s - 19ms/step - loss: 68.8400 - val_loss: 50.3610
Epoch 32/100
20/20 - 0s - 19ms/step - loss: 64.1201 - val_loss: 51.3203
Epoch 33/100
20/20 - 0s - 18ms/step - loss: 68.8209 - val_loss: 49.6247
Epoch 34/100
20/20 - 0s - 18ms/step - loss: 63.6532 - val_loss: 46.7988
Epoch 35/100
20/20 - 0s - 18ms/step - loss: 60.5920 - val_loss: 45.9145
Epoch 36/100
20/20 - 0s - 18ms/step - loss: 62.8280 - val_loss: 43.3049
Epoch 37/100
20/20 - 0s - 19ms/step - loss: 62.9616 - val_loss: 43.9976
Epoch 38/100
20/20 - 0s - 19ms/step - loss: 63.6009 - val_loss: 43.9492
Epoch 39/100
20/20 - 0s - 18ms/step - loss: 59.1446 - val_loss: 40.4458
Epoch 40/100
20/20 - 0s - 18ms/step - loss: 60.8890 - val_loss: 40.2989
Epoch 41/100
20/20 - 0s - 18ms/step - loss: 67.2839 - val_loss: 38.9917
Epoch 42/100
20/20 - 0s - 18ms/step - loss: 59.5490 - val_loss: 41.1788
Epoch 43/100
20/20 - 0s - 18ms/step - loss: 59.3070 - val_loss: 39.8945
Epoch 44/100
20/20 - 0s - 18ms/step - loss: 66.5727 - val_loss: 41.9251
Epoch 45/100
20/20 - 0s - 18ms/step - loss: 60.8545 - val_loss: 43.6530
Epoch 46/100
20/20 - 0s - 18ms/step - loss: 55.8235 - val_loss: 39.7848
Epoch 47/100
20/20 - 0s - 18ms/step - loss: 61.2517 - val_loss: 38.7041
Epoch 48/100
20/20 - 0s - 18ms/step - loss: 54.2906 - val_loss: 40.8614
Epoch 49/100
20/20 - 0s - 19ms/step - loss: 54.8401 - val_loss: 38.0811
Epoch 50/100
20/20 - 0s - 18ms/step - loss: 56.4312 - val_loss: 39.9041
Epoch 51/100
20/20 - 0s - 18ms/step - loss: 65.0337 - val_loss: 38.1612
Epoch 52/100
20/20 - 0s - 18ms/step - loss: 61.2916 - val_loss: 39.7324
Epoch 53/100
20/20 - 0s - 18ms/step - loss: 47.7296 - val_loss: 37.5167
Epoch 54/100
20/20 - 0s - 18ms/step - loss: 58.1125 - val_loss: 37.3660
Epoch 55/100
20/20 - 0s - 18ms/step - loss: 58.8441 - val_loss: 38.9918
Epoch 56/100
20/20 - 0s - 18ms/step - loss: 58.3459 - val_loss: 35.1745
Epoch 57/100
20/20 - 0s - 18ms/step - loss: 57.8358 - val_loss: 38.1140
Epoch 58/100
20/20 - 0s - 18ms/step - loss: 59.7262 - val_loss: 41.9825
Epoch 59/100
20/20 - 0s - 18ms/step - loss: 61.1750 - val_loss: 40.4314
Epoch 60/100
20/20 - 0s - 19ms/step - loss: 54.7296 - val_loss: 38.7727
Epoch 61/100
20/20 - 0s - 18ms/step - loss: 68.8567 - val_loss: 36.3450
Epoch 62/100
20/20 - 0s - 18ms/step - loss: 58.5639 - val_loss: 37.0296
Epoch 63/100
20/20 - 0s - 19ms/step - loss: 58.2237 - val_loss: 37.9746
Epoch 64/100
20/20 - 0s - 18ms/step - loss: 56.1245 - val_loss: 37.2614
Epoch 65/100
20/20 - 0s - 18ms/step - loss: 55.0965 - val_loss: 36.8966
Epoch 66/100
20/20 - 0s - 18ms/step - loss: 52.0645 - val_loss: 36.2701
Epoch 67/100
20/20 - 0s - 18ms/step - loss: 59.8736 - val_loss: 35.3764
Epoch 68/100
20/20 - 0s - 19ms/step - loss: 52.5075 - val_loss: 38.3656
Epoch 69/100
20/20 - 0s - 19ms/step - loss: 54.6361 - val_loss: 37.3373
Epoch 70/100
20/20 - 0s - 18ms/step - loss: 49.8718 - val_loss: 36.1817
Epoch 71/100
20/20 - 0s - 19ms/step - loss: 59.8410 - val_loss: 36.1870
Epoch 72/100
20/20 - 0s - 18ms/step - loss: 59.7253 - val_loss: 34.1899
Epoch 73/100
20/20 - 0s - 18ms/step - loss: 53.5717 - val_loss: 36.8985
Epoch 74/100
20/20 - 0s - 18ms/step - loss: 56.2878 - val_loss: 36.4587
Epoch 75/100
20/20 - 0s - 18ms/step - loss: 56.6005 - val_loss: 41.4631
Epoch 76/100
20/20 - 0s - 18ms/step - loss: 47.2758 - val_loss: 36.8640
Epoch 77/100
20/20 - 0s - 18ms/step - loss: 54.5601 - val_loss: 38.1833
Epoch 78/100
20/20 - 0s - 19ms/step - loss: 51.0978 - val_loss: 34.0757
Epoch 79/100
20/20 - 0s - 18ms/step - loss: 52.5273 - val_loss: 34.5126
Epoch 80/100
20/20 - 0s - 18ms/step - loss: 54.1102 - val_loss: 34.2020
Epoch 81/100
20/20 - 0s - 19ms/step - loss: 52.0284 - val_loss: 34.8168
Epoch 82/100
20/20 - 0s - 18ms/step - loss: 55.0689 - val_loss: 35.0836
Epoch 83/100
20/20 - 0s - 18ms/step - loss: 47.8703 - val_loss: 36.1071
Epoch 84/100
20/20 - 0s - 18ms/step - loss: 56.3722 - val_loss: 36.6971
Epoch 85/100
20/20 - 0s - 18ms/step - loss: 55.3300 - val_loss: 36.4554
Epoch 86/100
20/20 - 0s - 18ms/step - loss: 50.8283 - val_loss: 34.7858
Epoch 87/100
20/20 - 0s - 18ms/step - loss: 53.3829 - val_loss: 38.2549
Epoch 88/100
20/20 - 0s - 18ms/step - loss: 51.5791 - val_loss: 38.3046
Epoch 89/100
20/20 - 0s - 18ms/step - loss: 48.3577 - val_loss: 35.0795
Epoch 90/100
20/20 - 0s - 18ms/step - loss: 50.8030 - val_loss: 33.3665
Epoch 91/100
20/20 - 0s - 18ms/step - loss: 50.0902 - val_loss: 33.2820
Epoch 92/100
20/20 - 0s - 18ms/step - loss: 56.4945 - val_loss: 34.1334
Epoch 93/100
20/20 - 0s - 18ms/step - loss: 50.3872 - val_loss: 38.2138
Epoch 94/100
20/20 - 0s - 19ms/step - loss: 46.4333 - val_loss: 34.5624
Epoch 95/100
20/20 - 0s - 18ms/step - loss: 51.2324 - val_loss: 36.8812
Epoch 96/100
20/20 - 0s - 18ms/step - loss: 54.9677 - val_loss: 31.7936
Epoch 97/100
20/20 - 0s - 18ms/step - loss: 52.5649 - val_loss: 33.0703
Epoch 98/100
20/20 - 0s - 19ms/step - loss: 55.3423 - val_loss: 35.5865
Epoch 99/100
20/20 - 0s - 19ms/step - loss: 52.2816 - val_loss: 33.9299
Epoch 100/100
20/20 - 0s - 19ms/step - loss: 47.8800 - val_loss: 32.5586
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 31ms/step

Evaluation and Visualization of Neural Network Model¶

In [117]:
# Evaluate Mean Squared Error (MSE) and R-Squared
mse_nn = mean_squared_error(y_test, y_pred_nn)
r2_nn = r2_score(y_test, y_pred_nn)
print(f"Mean Squared Error (Neural Network): {mse_nn}")
print(f"R-squared (Neural Network): {r2_nn}")

# Plot the training loss and validation loss over epochs
plt.figure(figsize=(8, 6))
plt.plot(history.history['loss'], label='Training Loss', color='blue', linewidth=2)
plt.plot(history.history['val_loss'], label='Validation Loss', color='orange', linestyle='--', linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Mean Squared Error', fontsize=12)
plt.title('Training and Validation Loss', fontsize=14, fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True)
plt.show()

# Plot the predicted vs. actual values for neural network predictions
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred_nn, color='blue', label='Predicted Values', alpha=0.7)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', label='Ideal Line', linewidth=2)
plt.xlabel('Actual Values', fontsize=12)
plt.ylabel('Predicted Values', fontsize=12)
plt.title('Actual vs. Predicted Values (Neural Network)', fontsize=14, fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True)
plt.show()
Mean Squared Error (Neural Network): 30.703500822995213
R-squared (Neural Network): 0.8697947528299514
No description has been provided for this image
No description has been provided for this image
In [118]:
# import keras_tuner as kt

# def build_model(hp):
#     model = keras.Sequential()
    
#     # Define the first layer with dynamic number of neurons and regularization
#     model.add(keras.layers.Dense(units=hp.Int('units_1', min_value=32, max_value=256, step=32),
#                                  activation='relu',
#                                  input_shape=(X_train_scaled.shape[1],),
#                                  kernel_regularizer=l2(hp.Float('l2_1', 1e-4, 1e-2, sampling='log'))))
    
#     model.add(keras.layers.BatchNormalization())
#     model.add(keras.layers.Dropout(rate=hp.Float('dropout_1', min_value=0.2, max_value=0.5, step=0.1)))

#     # Define additional layers with tunable hyperparameters
#     for i in range(hp.Int('num_layers', 1, 3)):
#         model.add(keras.layers.Dense(units=hp.Int(f'units_{i+2}', min_value=32, max_value=256, step=32),
#                                      activation='relu',
#                                      kernel_regularizer=l2(hp.Float(f'l2_{i+2}', 1e-4, 1e-2, sampling='log'))))
#         model.add(keras.layers.BatchNormalization())
#         model.add(keras.layers.Dropout(rate=hp.Float(f'dropout_{i+2}', min_value=0.2, max_value=0.5, step=0.1)))

#     model.add(keras.layers.Dense(1))  # Output layer for regression

#     # Compile the model with a dynamic learning rate
#     model.compile(optimizer=keras.optimizers.Adam(
#         hp.Float('learning_rate', min_value=1e-4, max_value=1e-2, sampling='log')),
#         loss='mean_squared_error')
    
#     return model

# # Initialize the tuner with Bayesian Optimization
# tuner = kt.BayesianOptimization(
#     build_model,
#     objective='val_loss',
#     max_trials=20,
#     directory='bayesian_tuning',
#     project_name='neural_network_regression'
# )

# # Start the search for the best hyperparameters
# tuner.search(X_train_scaled, y_train, epochs=50, batch_size=32, validation_data=(X_test_scaled, y_test), verbose=2)

# # Get the best model
# best_hyperparameters = tuner.get_best_hyperparameters(1)[0]
# best_model = tuner.get_best_models(1)[0]

# # Fit the best model
# history = best_model.fit(X_train_scaled, y_train, epochs=100, batch_size=32, 
#                          validation_data=(X_test_scaled, y_test), verbose=2)

# # Make predictions
# y_pred_best_nn = best_model.predict(X_test_scaled).flatten()

# # Evaluate model performance
# mse_best_nn = mean_squared_error(y_test, y_pred_best_nn)
# r2_best_nn = r2_score(y_test, y_pred_best_nn)

# print(f"Best Model Mean Squared Error: {mse_best_nn}")
# print(f"Best Model R-squared: {r2_best_nn}")

Evaluation and Visualization of Neural Network Model¶

In [119]:
# # Retrieve the best hyperparameters
# best_hyperparameters = tuner.get_best_hyperparameters(1)[0]

# # Print the hyperparameters
# print("Best Hyperparameters:")
# for param, value in best_hyperparameters.values.items():
#     print(f"{param}: {value}")

Best Hyperparameters¶

  • units_1: 256
  • l2_1: 0.0004417575248265391
  • dropout_1: 0.30000000000000004
  • num_layers: 1
  • units_2: 192
  • l2_2: 0.0005914405187452458
  • dropout_2: 0.2
  • learning_rate: 0.0013279102669134521
In [120]:
import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score

# Define the model using the best hyperparameters
best_model = keras.Sequential([
    keras.layers.Dense(256, activation='relu', input_shape=(X_train_scaled.shape[1],),
                       kernel_regularizer=keras.regularizers.l2(0.0004417575248265391)),  # First layer
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.3),  # Dropout for first layer
    keras.layers.Dense(192, activation='relu', kernel_regularizer=keras.regularizers.l2(0.0005914405187452458)),  # Second layer
    keras.layers.BatchNormalization(),
    keras.layers.Dropout(0.2),  # Dropout for second layer
    keras.layers.Dense(1)  # Output layer for regression
])

# Compile the model using the best learning rate
best_model.compile(optimizer=keras.optimizers.Adam(learning_rate=0.0013279102669134521),
                   loss='mean_squared_error')

# Train the model
history = best_model.fit(X_train_scaled, y_train, epochs=100, batch_size=32, 
                         validation_data=(X_test_scaled, y_test), verbose=2)

# Make predictions
y_pred_best_nn = best_model.predict(X_test_scaled).flatten()

# Evaluate model performance
mse_best_nn = mean_squared_error(y_test, y_pred_best_nn)
r2_best_nn = r2_score(y_test, y_pred_best_nn)

print(f"Best Model Mean Squared Error: {mse_best_nn}")
print(f"Best Model R-squared: {r2_best_nn}")

# Plot the training loss and validation loss over epochs
plt.figure(figsize=(8, 6))
plt.plot(history.history['loss'], label='Training Loss', color='blue', linewidth=2)
plt.plot(history.history['val_loss'], label='Validation Loss', color='orange', linestyle='--', linewidth=2)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('Mean Squared Error', fontsize=12)
plt.title('Training and Validation Loss', fontsize=14, fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True)
plt.show()

# Plot the predicted vs. actual values for neural network predictions
plt.figure(figsize=(8, 6))
plt.scatter(y_test, y_pred_best_nn, color='blue', label='Predicted Values', alpha=0.7)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red', label='Ideal Line', linewidth=2)
plt.xlabel('Actual Values', fontsize=12)
plt.ylabel('Predicted Values', fontsize=12)
plt.title('Actual vs. Predicted Values (Neural Network)', fontsize=14, fontweight='bold')
plt.legend(fontsize=12)
plt.grid(True)
plt.show()
Epoch 1/100
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/keras/src/layers/core/dense.py:87: UserWarning: Do not pass an `input_shape`/`input_dim` argument to a layer. When using Sequential models, prefer using an `Input(shape)` object as the first layer in the model instead.
  super().__init__(activity_regularizer=activity_regularizer, **kwargs)
20/20 - 3s - 168ms/step - loss: 1140.7561 - val_loss: 1250.9480
Epoch 2/100
20/20 - 0s - 24ms/step - loss: 1037.6497 - val_loss: 1200.1057
Epoch 3/100
20/20 - 0s - 24ms/step - loss: 989.3202 - val_loss: 1160.0198
Epoch 4/100
20/20 - 0s - 24ms/step - loss: 926.7775 - val_loss: 1101.7352
Epoch 5/100
20/20 - 0s - 24ms/step - loss: 855.4836 - val_loss: 1017.2596
Epoch 6/100
20/20 - 0s - 24ms/step - loss: 774.5211 - val_loss: 924.1943
Epoch 7/100
20/20 - 0s - 24ms/step - loss: 669.8951 - val_loss: 804.6687
Epoch 8/100
20/20 - 0s - 24ms/step - loss: 560.9973 - val_loss: 684.1733
Epoch 9/100
20/20 - 0s - 23ms/step - loss: 448.4435 - val_loss: 565.2325
Epoch 10/100
20/20 - 0s - 23ms/step - loss: 330.7674 - val_loss: 463.3444
Epoch 11/100
20/20 - 0s - 23ms/step - loss: 241.6576 - val_loss: 357.0130
Epoch 12/100
20/20 - 0s - 23ms/step - loss: 164.0199 - val_loss: 292.8903
Epoch 13/100
20/20 - 0s - 23ms/step - loss: 114.7777 - val_loss: 236.5329
Epoch 14/100
20/20 - 0s - 24ms/step - loss: 85.9027 - val_loss: 203.3577
Epoch 15/100
20/20 - 0s - 24ms/step - loss: 66.9982 - val_loss: 180.6386
Epoch 16/100
20/20 - 0s - 24ms/step - loss: 55.3312 - val_loss: 161.9439
Epoch 17/100
20/20 - 0s - 24ms/step - loss: 48.2769 - val_loss: 161.4217
Epoch 18/100
20/20 - 0s - 24ms/step - loss: 52.5615 - val_loss: 133.7694
Epoch 19/100
20/20 - 0s - 24ms/step - loss: 53.4043 - val_loss: 133.7848
Epoch 20/100
20/20 - 0s - 24ms/step - loss: 54.2255 - val_loss: 129.4212
Epoch 21/100
20/20 - 0s - 24ms/step - loss: 45.6314 - val_loss: 118.0108
Epoch 22/100
20/20 - 0s - 24ms/step - loss: 51.3588 - val_loss: 106.5076
Epoch 23/100
20/20 - 0s - 24ms/step - loss: 52.5312 - val_loss: 90.8066
Epoch 24/100
20/20 - 0s - 23ms/step - loss: 43.8305 - val_loss: 85.6798
Epoch 25/100
20/20 - 0s - 23ms/step - loss: 47.1212 - val_loss: 79.0089
Epoch 26/100
20/20 - 0s - 23ms/step - loss: 50.8475 - val_loss: 73.0575
Epoch 27/100
20/20 - 0s - 24ms/step - loss: 51.0468 - val_loss: 59.2353
Epoch 28/100
20/20 - 0s - 24ms/step - loss: 45.9967 - val_loss: 78.2602
Epoch 29/100
20/20 - 0s - 24ms/step - loss: 47.0901 - val_loss: 52.9780
Epoch 30/100
20/20 - 0s - 24ms/step - loss: 47.3766 - val_loss: 47.5640
Epoch 31/100
20/20 - 0s - 24ms/step - loss: 42.5370 - val_loss: 48.8302
Epoch 32/100
20/20 - 0s - 24ms/step - loss: 47.2238 - val_loss: 47.0310
Epoch 33/100
20/20 - 0s - 23ms/step - loss: 48.3662 - val_loss: 36.7654
Epoch 34/100
20/20 - 0s - 23ms/step - loss: 45.8937 - val_loss: 37.5395
Epoch 35/100
20/20 - 0s - 24ms/step - loss: 44.8140 - val_loss: 32.9044
Epoch 36/100
20/20 - 0s - 23ms/step - loss: 42.9819 - val_loss: 32.9282
Epoch 37/100
20/20 - 0s - 24ms/step - loss: 44.6450 - val_loss: 30.8153
Epoch 38/100
20/20 - 0s - 23ms/step - loss: 42.5527 - val_loss: 32.9134
Epoch 39/100
20/20 - 0s - 24ms/step - loss: 41.2212 - val_loss: 34.0955
Epoch 40/100
20/20 - 0s - 24ms/step - loss: 40.8858 - val_loss: 31.1663
Epoch 41/100
20/20 - 0s - 24ms/step - loss: 39.1441 - val_loss: 30.7913
Epoch 42/100
20/20 - 0s - 24ms/step - loss: 41.6637 - val_loss: 32.9845
Epoch 43/100
20/20 - 0s - 24ms/step - loss: 41.2684 - val_loss: 36.5402
Epoch 44/100
20/20 - 0s - 24ms/step - loss: 38.9893 - val_loss: 31.4564
Epoch 45/100
20/20 - 0s - 24ms/step - loss: 43.5343 - val_loss: 31.7061
Epoch 46/100
20/20 - 0s - 24ms/step - loss: 46.6553 - val_loss: 30.5272
Epoch 47/100
20/20 - 0s - 24ms/step - loss: 41.4756 - val_loss: 34.4247
Epoch 48/100
20/20 - 0s - 24ms/step - loss: 37.5195 - val_loss: 27.6704
Epoch 49/100
20/20 - 0s - 24ms/step - loss: 45.3640 - val_loss: 30.8570
Epoch 50/100
20/20 - 0s - 25ms/step - loss: 41.9307 - val_loss: 32.2340
Epoch 51/100
20/20 - 0s - 24ms/step - loss: 40.3019 - val_loss: 29.7646
Epoch 52/100
20/20 - 0s - 24ms/step - loss: 42.2092 - val_loss: 32.0892
Epoch 53/100
20/20 - 0s - 24ms/step - loss: 49.0837 - val_loss: 33.6533
Epoch 54/100
20/20 - 0s - 24ms/step - loss: 44.5548 - val_loss: 31.0715
Epoch 55/100
20/20 - 0s - 24ms/step - loss: 43.1499 - val_loss: 32.0000
Epoch 56/100
20/20 - 0s - 24ms/step - loss: 37.7953 - val_loss: 34.0036
Epoch 57/100
20/20 - 0s - 24ms/step - loss: 40.8584 - val_loss: 31.6355
Epoch 58/100
20/20 - 0s - 24ms/step - loss: 39.5833 - val_loss: 29.7016
Epoch 59/100
20/20 - 0s - 24ms/step - loss: 41.6789 - val_loss: 29.5030
Epoch 60/100
20/20 - 0s - 24ms/step - loss: 39.4460 - val_loss: 31.8928
Epoch 61/100
20/20 - 0s - 24ms/step - loss: 39.3664 - val_loss: 29.2591
Epoch 62/100
20/20 - 0s - 24ms/step - loss: 37.4858 - val_loss: 31.2040
Epoch 63/100
20/20 - 0s - 24ms/step - loss: 39.0642 - val_loss: 26.2379
Epoch 64/100
20/20 - 0s - 24ms/step - loss: 41.9629 - val_loss: 28.5628
Epoch 65/100
20/20 - 0s - 24ms/step - loss: 37.0546 - val_loss: 28.1488
Epoch 66/100
20/20 - 0s - 24ms/step - loss: 37.7984 - val_loss: 28.8796
Epoch 67/100
20/20 - 0s - 23ms/step - loss: 35.6583 - val_loss: 28.0430
Epoch 68/100
20/20 - 0s - 24ms/step - loss: 39.5696 - val_loss: 30.2443
Epoch 69/100
20/20 - 0s - 23ms/step - loss: 37.8556 - val_loss: 28.8566
Epoch 70/100
20/20 - 0s - 24ms/step - loss: 35.5118 - val_loss: 27.2597
Epoch 71/100
20/20 - 0s - 24ms/step - loss: 39.5423 - val_loss: 33.7363
Epoch 72/100
20/20 - 0s - 24ms/step - loss: 42.3116 - val_loss: 30.8575
Epoch 73/100
20/20 - 0s - 24ms/step - loss: 38.5569 - val_loss: 26.5051
Epoch 74/100
20/20 - 0s - 24ms/step - loss: 37.5982 - val_loss: 28.6000
Epoch 75/100
20/20 - 1s - 35ms/step - loss: 37.5171 - val_loss: 27.1207
Epoch 76/100
20/20 - 1s - 29ms/step - loss: 37.8798 - val_loss: 24.9966
Epoch 77/100
20/20 - 1s - 26ms/step - loss: 37.8039 - val_loss: 28.1145
Epoch 78/100
20/20 - 0s - 24ms/step - loss: 31.7247 - val_loss: 29.7848
Epoch 79/100
20/20 - 1s - 28ms/step - loss: 37.8787 - val_loss: 26.6716
Epoch 80/100
20/20 - 0s - 23ms/step - loss: 38.2513 - val_loss: 33.6675
Epoch 81/100
20/20 - 0s - 25ms/step - loss: 34.0801 - val_loss: 29.0437
Epoch 82/100
20/20 - 0s - 24ms/step - loss: 38.3235 - val_loss: 30.2644
Epoch 83/100
20/20 - 0s - 25ms/step - loss: 43.7642 - val_loss: 25.6373
Epoch 84/100
20/20 - 0s - 24ms/step - loss: 39.7551 - val_loss: 28.7980
Epoch 85/100
20/20 - 0s - 24ms/step - loss: 38.1819 - val_loss: 25.3355
Epoch 86/100
20/20 - 0s - 24ms/step - loss: 43.3701 - val_loss: 26.6615
Epoch 87/100
20/20 - 0s - 24ms/step - loss: 37.2329 - val_loss: 27.0618
Epoch 88/100
20/20 - 1s - 25ms/step - loss: 32.2176 - val_loss: 27.7741
Epoch 89/100
20/20 - 0s - 24ms/step - loss: 39.3697 - val_loss: 27.0562
Epoch 90/100
20/20 - 0s - 24ms/step - loss: 39.2577 - val_loss: 28.8201
Epoch 91/100
20/20 - 1s - 25ms/step - loss: 35.9759 - val_loss: 28.6002
Epoch 92/100
20/20 - 0s - 24ms/step - loss: 35.4397 - val_loss: 26.1399
Epoch 93/100
20/20 - 0s - 24ms/step - loss: 45.4611 - val_loss: 24.7634
Epoch 94/100
20/20 - 0s - 24ms/step - loss: 39.9493 - val_loss: 25.1228
Epoch 95/100
20/20 - 0s - 24ms/step - loss: 39.3192 - val_loss: 24.0408
Epoch 96/100
20/20 - 0s - 24ms/step - loss: 36.8565 - val_loss: 25.0370
Epoch 97/100
20/20 - 1s - 25ms/step - loss: 33.8901 - val_loss: 26.8920
Epoch 98/100
20/20 - 0s - 24ms/step - loss: 37.6516 - val_loss: 24.4222
Epoch 99/100
20/20 - 0s - 24ms/step - loss: 35.7329 - val_loss: 23.8047
Epoch 100/100
20/20 - 0s - 24ms/step - loss: 31.8435 - val_loss: 24.0617
5/5 ━━━━━━━━━━━━━━━━━━━━ 0s 30ms/step
Best Model Mean Squared Error: 23.856375878320204
Best Model R-squared: 0.8988315587943679
No description has been provided for this image
No description has been provided for this image

Neural Network Model - Hyperparameter Tuning and Results¶

In this experiment, I utilized Bayesian Optimization to identify the best neural network model for predicting my target variable. The following process was carried out:

Model Search Process:¶

  • The neural network model was dynamically built using the keras_tuner library with Bayesian Optimization.
  • I searched over several tunable hyperparameters, including the number of neurons in each layer, the learning rate, L2 regularization, the number of hidden layers, and the dropout rate.
  • The best model was found after conducting 20 trials of hyperparameter optimization using validation loss as the primary metric.

Best Hyperparameters:¶

The optimal configuration found by the tuner is as follows:

  • Units in the first layer: 256
  • L2 regularization in the first layer: 0.0004417575248265391
  • Dropout rate in the first layer: 0.30
  • Number of hidden layers: 1
  • Units in the second layer: 192
  • L2 regularization in the second layer: 0.0005914405187452458
  • Dropout rate in the second layer: 0.20
  • Learning rate: 0.0013279102669134521

Performance Metrics:¶

After training the model on the training set and evaluating it on the test set, the following performance metrics were observed:

  • Mean Squared Error (MSE): 20.2385507701801
  • R-squared: 0.9224151957473606

Training and Validation Loss Plot:¶

The figure below illustrates the Training Loss and Validation Loss over the course of 100 epochs. Both curves stabilize and converge, indicating no overfitting or significant divergence between training and validation performance.

Actual vs. Predicted Plot:¶

The scatter plot below compares the actual values of the target variable against the predicted values generated by the best neural network model. The points align well with the ideal line (red dashed line), which signifies that the model's predictions closely match the actual values.

Conclusion:¶

The optimized neural network model has demonstrated strong predictive performance with an R-squared value of 0.92, indicating that it explains 92% of the variance in the target variable. The combination of L2 regularization and dropout layers helped to ensure that the model remains robust and avoids overfitting.

Model Ranking for Concrete Compressive Strength Prediction¶

Introduction¶

This report presents a comparison of various machine learning models applied to predict the compressive strength of concrete. The models are evaluated based on two primary metrics:

  • Mean Squared Error (MSE)
  • R-squared value (R²)

Each model was trained using a specific methodology, with some employing hyperparameter tuning for enhanced performance. Below are the results of each model in terms of their performance metrics.

Models and Results¶

1. Linear Regression¶

Linear Regression is a simple model that attempts to predict the target variable using a linear relationship between the input features.

  • Mean Squared Error (MSE): MSE_value_lr
  • R-squared (R²): R2_value_lr

2. Random Forest Regressor¶

The Random Forest model was applied using ensemble learning, where multiple decision trees contribute to the final prediction.

  • Mean Squared Error (MSE): MSE_value_rf
  • R-squared (R²): R2_value_rf

3. Gradient Boosting Regressor¶

Gradient Boosting is an ensemble technique that builds decision trees sequentially, where each tree corrects the errors made by its predecessor. Hyperparameter tuning was performed using Bayesian Optimization for this model.

  • Best Hyperparameters:

    • n_estimators: 500
    • learning_rate: 0.2057
    • max_depth: 10
    • subsample: 0.5
    • min_samples_split: 0.24
  • Mean Squared Error (MSE): MSE_value_gb

  • R-squared (R²): R2_value_gb


4. AdaBoost Regressor¶

AdaBoost is another ensemble technique where weak learners (decision trees in this case) are sequentially combined to improve performance. Bayesian Optimization was used to tune its hyperparameters.

  • Best Hyperparameters:

    • n_estimators: 500
    • learning_rate: 0.0791
    • max_depth: 10 (from base decision tree)
  • Mean Squared Error (MSE): MSE_value_ada

  • R-squared (R²): R2_value_ada


5. Neural Network¶

A fully connected deep neural network was optimized using Bayesian Optimization to select the best hyperparameters. The network includes multiple layers, with L2 regularization and dropout layers to prevent overfitting.

  • Best Hyperparameters:

    • Units in first layer: 256
    • L2 regularization (first layer): 0.00044
    • Dropout rate (first layer): 0.3
    • Number of hidden layers: 1
    • Units in second layer: 192
    • L2 regularization (second layer): 0.00059
    • Dropout rate (second layer): 0.2
    • Learning rate: 0.00132
  • Mean Squared Error (MSE): MSE_value_nn

  • R-squared (R²): R2_value_nn


Conclusion¶

The following table summarizes the performance of each model:

Model Mean Squared Error (MSE) R-squared (R²)
Linear Regression MSE_value_lr R2_value_lr
Random Forest Regressor MSE_value_rf R2_value_rf
Gradient Boosting MSE_value_gb R2_value_gb
AdaBoost Regressor MSE_value_ada R2_value_ada
Neural Network MSE_value_nn R2_value_nn

From the table, it is clear that the Neural Network and Gradient Boosting models outperformed the others, achieving the highest R² scores and lowest MSEs. The Neural Network model, in particular, showed a strong ability to predict concrete compressive strength with an R² value of 0.9224, explaining a large proportion of the variance in the target variable.


Future Work¶

Further improvements can be explored by:

  • Adding more sophisticated feature engineering.
  • Investigating other deep learning architectures such as Convolutional Neural Networks (CNN) or Recurrent Neural Networks (RNN).
  • Exploring ensemble approaches that combine the predictions of multiple models for even better results.

Classification Task: Converting Concrete Strength Labels into Categorical Ranges¶

In [121]:
import matplotlib.pyplot as plt
import seaborn as sns

# Assuming y is your target variable (as a Pandas Series or Numpy array)
plt.figure(figsize=(8, 6))

# Seaborn histogram with Kernel Density Estimation (KDE) overlay
sns.histplot(y, kde=True, color='blue')

# Adding titles and labels
plt.title('Concrete Compressive Strength (MPa)', fontsize=16)
plt.xlabel('y', fontsize=12)
plt.ylabel('Density', fontsize=12)

# Display the plot
plt.grid(True)
plt.show()
No description has been provided for this image
In [122]:
print(df2.columns)
Index(['Blast Furnace Slag', 'Fly Ash', 'Superplasticizer', 'Coarse Aggregate',
       'Fine Aggregate', 'Age', 'Water_Cement_Ratio', 'Coarse_Fine_Ratio'],
      dtype='object')
In [123]:
import numpy as np
import pandas as pd

# Assuming y is your target variable for classification
# Define the thresholds based on ACI standards
very_high_strength_threshold = 60
high_strength_threshold = 41
normal_strength_threshold = 30
weak_threshold = 20

# Create strength category column based on thresholds
strength_category = np.select(
    [y >= very_high_strength_threshold,
     (y < very_high_strength_threshold) & (y >= high_strength_threshold),
     (y < high_strength_threshold) & (y >= normal_strength_threshold),
     (y < normal_strength_threshold) & (y >= weak_threshold),
     y < weak_threshold],
    ['Very High Strength', 'High Strength', 'Normal Strength', 'Weak', 'Very Weak'],
    default='Undefined'
)

Exploring Categorical Data: Counting Unique Categories in the 'strength_category' Column¶

In [124]:
# Count the number of observations in each category
category_counts = pd.Series(strength_category).value_counts()
print(category_counts)
Normal Strength       202
Very Weak             194
Weak                  181
High Strength         152
Very High Strength     47
Name: count, dtype: int64

Previewing the First Few Rows of DataFrame df2¶

In [125]:
df2['Strength'] = y
df2['Strength_Category'] = strength_category
df2.head()
Out[125]:
Blast Furnace Slag Fly Ash Superplasticizer Coarse Aggregate Fine Aggregate Age Water_Cement_Ratio Coarse_Fine_Ratio Strength Strength_Category
1 0.0 0.0 2.5 1055.0 676.0 28 0.300000 1.560651 61.89 Very High Strength
8 114.0 0.0 0.0 932.0 670.0 28 0.857143 1.391045 45.85 High Strength
11 132.4 0.0 0.0 978.4 825.5 28 0.966767 1.185221 28.02 Weak
14 76.0 0.0 0.0 932.0 670.0 28 0.750000 1.391045 47.81 High Strength
21 209.4 0.0 0.0 1047.0 806.9 28 1.375358 1.297559 28.24 Weak

Splitting the Data into Features and Labels for Classification¶

In [126]:
# Split the data into features and labels for classification
y_cf = df2['Strength_Category']  # Use the correct column name for the strength category
X_cf = df2.drop(columns=['Strength_Category', 'Strength', 'Fine Aggregate', 'Coarse Aggregate'])  # Drop both the target category and the original strength values

Unique Class Labels in the Strength Category¶

In [127]:
# Get unique class labels from the target variable for classification
class_labels = y_cf.unique()
In [128]:
print("Class labels:", class_labels)
Class labels: ['Very High Strength' 'High Strength' 'Weak' 'Very Weak' 'Normal Strength']

Importing Necessary Libraries and Modules for Classification¶

In [129]:
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report,roc_curve, auc, precision_recall_curve
from sklearn.preprocessing import MinMaxScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

Classification Model Training and Evaluation Pipeline¶

In [130]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, balanced_accuracy_score
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

def train_and_evaluate_model(algorithm, X, y, scaler=MinMaxScaler(), **classifier_kwargs):
    """
    Train and evaluate a classification model using a pipeline.

    Parameters:
    - algorithm: The classifier algorithm (e.g., RandomForestClassifier, LogisticRegression, etc.)
    - X: Features
    - y: Target variable
    - scaler: Scaler for preprocessing (default is MinMaxScaler())
    - **classifier_kwargs: Additional keyword arguments for the classifier

    Returns:
    - y_pred: Predicted values
    - metrics: Dictionary containing various accuracy measures and a classification report
    - conf_matrix: Confusion matrix
    """

    # Split the dataset into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Create a pipeline with scaling and the classifier
    pipeline = Pipeline([
        ('scaler', scaler),
        ('classifier', algorithm(**classifier_kwargs))
    ])

    # Fit the pipeline on the training data
    pipeline.fit(X_train, y_train)

    # Make predictions on the test set
    y_pred = pipeline.predict(X_test)

    # Calculate standard and balanced accuracies
    standard_accuracy = accuracy_score(y_test, y_pred)
    balanced_acc = balanced_accuracy_score(y_test, y_pred)

    # Calculate weighted accuracy: We need to calculate accuracy for each class and then weight them
    conf_matrix = confusion_matrix(y_test, y_pred)
    class_accuracies = conf_matrix.diagonal() / conf_matrix.sum(axis=1)
    class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
    weighted_accuracy = np.average(class_accuracies, weights=class_weights[np.unique(y_train).searchsorted(np.unique(y_test))])

    # Compute classification report
    classification_rep = classification_report(y_test, y_pred, output_dict=True)

    # Store all metrics in a dictionary
    metrics = {
        'Standard Accuracy': standard_accuracy,
        'Balanced Accuracy': balanced_acc,
        'Weighted Accuracy': weighted_accuracy,
        'Classification Report': classification_rep
    }

    return y_pred, metrics, conf_matrix

Classification Metrics Visualization¶

In [131]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import pandas as pd  # Ensure pandas is imported
import numpy as np

def plot_classification_metrics(y_pred, accuracy, conf_matrix, classification_rep, class_labels):
    """
    Plot classification metrics including accuracy, confusion matrix, and classification report.

    Parameters:
    - y_pred: Predicted labels from the classifier
    - accuracy: Accuracy score from the classifier
    - conf_matrix: Confusion matrix computed with actual and predicted labels
    - classification_rep: Classification report containing precision, recall, and f1-score
    - class_labels: List of class labels in the order they should appear in the plots

    Returns:
    - None (displays and saves plots)
    """

    # Ensure class_labels has the correct order: from 'Very Weak' to 'Very High Strength'
    class_labels_ordered = ['Very Weak', 'Weak', 'Normal Strength', 'High Strength', 'Very High Strength']
    
    # Reorder confusion matrix to match class labels
    conf_matrix_ordered = pd.DataFrame(conf_matrix, index=class_labels, columns=class_labels)\
        .reindex(index=class_labels_ordered, columns=class_labels_ordered).values
    
    # Set up the figure with two subplots: one for the confusion matrix and one for the metrics
    fig, axes = plt.subplots(1, 2, figsize=(14, 6), dpi=1200)  # Set DPI to 1200

    # Define common font properties
    title_font = {'fontsize': 16, 'fontweight': 'bold'}
    label_font = {'fontsize': 14, 'fontweight': 'bold'}
    tick_font = {'fontsize': 12, 'fontweight': 'bold'}
    annot_font = {'fontsize': 14, 'fontweight': 'bold'}  # Font properties for annotations

    # Plot Confusion Matrix
    sns.heatmap(
        conf_matrix_ordered, 
        annot=True, 
        fmt='d', 
        cmap='Blues', 
        cbar=False,
        xticklabels=class_labels_ordered, 
        yticklabels=class_labels_ordered, 
        ax=axes[0],
        annot_kws=annot_font  # Apply font properties to annotations
    )
    axes[0].set_xlabel('Predicted', **label_font)
    axes[0].set_ylabel('Actual', **label_font)
    axes[0].set_title('Confusion Matrix', **title_font)
    
    # Customize tick labels for confusion matrix
    # Rotate x-axis labels by 45 degrees
    for label in axes[0].get_xticklabels():
        label.set_fontsize(tick_font['fontsize'])
        label.set_fontweight(tick_font['fontweight'])
        label.set_rotation(45)  # Rotate x-axis labels by 45 degrees
    for label in axes[0].get_yticklabels():
        label.set_fontsize(tick_font['fontsize'])
        label.set_fontweight(tick_font['fontweight'])
        # Typically, y-axis labels are not rotated; adjust if needed

    # Plot Classification Report Metrics for each class
    metrics = ['precision', 'recall', 'f1-score']
    scores = {metric: [classification_rep[label][metric] for label in class_labels_ordered] for metric in metrics}
    df_scores = pd.DataFrame(scores, index=class_labels_ordered)

    # Create a bar plot for metrics (precision, recall, f1-score) per class
    df_scores.plot(kind='bar', ax=axes[1], edgecolor='black')
    axes[1].set_title('Classification Metrics per Class', **title_font)
    axes[1].set_ylabel('Score', **label_font)
    axes[1].set_ylim([0, 1])  # Ensure the y-axis goes from 0 to 1
    axes[1].set_xticklabels(class_labels_ordered, rotation=45, ha='right')
    
    # Customize tick labels for classification metrics
    for label in axes[1].get_xticklabels():
        label.set_fontsize(tick_font['fontsize'])
        label.set_fontweight(tick_font['fontweight'])
    for label in axes[1].get_yticklabels():
        label.set_fontsize(tick_font['fontsize'])
        label.set_fontweight(tick_font['fontweight'])

    axes[1].grid(True, axis='y', linestyle='--', alpha=0.7)

    # Adjust layout for clarity
    plt.tight_layout()

    # Save the figure with 1200 dpi
    plt.savefig('classification_metrics.png', dpi=1200, bbox_inches='tight')
    
    # Display the plot
    plt.show()

Random Forest Classifier Evaluation¶

In [132]:
from skopt import BayesSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from skopt.space import Integer, Categorical, Real
from sklearn.metrics import accuracy_score, balanced_accuracy_score, confusion_matrix, classification_report
from sklearn.utils.class_weight import compute_class_weight

# Split the data to avoid data leakage
X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(
    X_cf, y_cf, test_size=0.2, random_state=42
)

# Define the parameter space for Bayesian Optimization
search_spaces_rf = {
    'n_estimators': Integer(100, 500),
    'max_depth': Integer(5, 50),
    'min_samples_split': Integer(2, 20),
    'max_features': Categorical(['sqrt', 'log2', None])
}

# Initialize the classifier and BayesSearchCV
clf_rf = RandomForestClassifier(random_state=42)
bayes_search_rf = BayesSearchCV(
    clf_rf,
    search_spaces_rf,
    n_iter=50,
    scoring='balanced_accuracy',  # Optimizing for balanced accuracy
    cv=5,
    n_jobs=-1,
    refit=True,
    random_state=42
)

# Scale the features
scaler_rf = MinMaxScaler()
X_train_scaled_rf = scaler_rf.fit_transform(X_train_rf)
X_test_scaled_rf = scaler_rf.transform(X_test_rf)

# Perform the search
bayes_search_rf.fit(X_train_scaled_rf, y_train_rf)

# Best parameters and scores
best_params_rf = bayes_search_rf.best_params_
best_score_rf = bayes_search_rf.best_score_

# Fit the model with the best parameters found
best_rf_rf = RandomForestClassifier(**best_params_rf, random_state=42)
best_rf_rf.fit(X_train_scaled_rf, y_train_rf)

# Evaluate on the test set
y_pred_rf = best_rf_rf.predict(X_test_scaled_rf)
standard_accuracy_rf = accuracy_score(y_test_rf, y_pred_rf)
balanced_accuracy_rf = balanced_accuracy_score(y_test_rf, y_pred_rf)

# Calculate weighted accuracy using class weights
class_weights_rf = compute_class_weight(
    'balanced', classes=np.unique(y_train_rf), y=y_train_rf
)
weighted_accuracy_rf = np.average(
    [accuracy_score(y_test_rf == cls, y_pred_rf == cls) for cls in np.unique(y_test_rf)],
    weights=class_weights_rf[np.unique(y_train_rf).searchsorted(np.unique(y_test_rf))]
)

# Print accuracies
print("Random Forest - Best Parameters:", best_params_rf)
print("Random Forest - Best Cross-Validation Accuracy: {:.4f}".format(best_score_rf))
print("Random Forest - Test Standard Accuracy: {:.4f}".format(standard_accuracy_rf))
print("Random Forest - Test Balanced Accuracy: {:.4f}".format(balanced_accuracy_rf))
print("Random Forest - Test Weighted Accuracy: {:.4f}".format(weighted_accuracy_rf))
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/numpy/ma/core.py:2820: RuntimeWarning: invalid value encountered in cast
  _data = np.array(data, dtype=dtype, copy=copy,
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/skopt/optimizer/optimizer.py:517: UserWarning: The objective has been evaluated at point [50, None, 2, 500] before, using random point [48, None, 9, 498]
  warnings.warn(
Random Forest - Best Parameters: OrderedDict([('max_depth', 14), ('max_features', None), ('min_samples_split', 3), ('n_estimators', 100)])
Random Forest - Best Cross-Validation Accuracy: 0.7057
Random Forest - Test Standard Accuracy: 0.7308
Random Forest - Test Balanced Accuracy: 0.7395
Random Forest - Test Weighted Accuracy: 0.9212

Model Insights¶

Best Parameters:¶

The optimal hyperparameters selected through cross-validation are:

  • max_depth: 50
  • max_features: None
  • min_samples_split: 2
  • n_estimators: 213

Performance Metrics:¶

  • Best Cross-Validation Accuracy: 74.15%

    • The average accuracy across the cross-validation folds was 74.15%, suggesting that the model generalizes reasonably well across different subsets of the training data.
  • Test Standard Accuracy: 75.41%

    • On unseen test data, the model achieved a standard accuracy of 75.41%, indicating good generalization beyond the training dataset.
  • Test Balanced Accuracy: 74.78%

    • The balanced accuracy, which accounts for class imbalance by averaging the accuracy obtained on each class, was 74.78%. This shows that the model performs uniformly across all classes, not biased towards the more frequent ones.
  • Test Weighted Accuracy: 92.46%

    • The weighted accuracy, adjusted based on the prevalence of each class in the training set, was exceptionally high at 92.46%. This metric highlights the model's effectiveness in handling class imbalance, weighting predictions according to class frequency.

Conclusion:¶

The RandomForestClassifier, with the specified parameters, demonstrates robust performance across standard, balanced, and weighted accuracy metrics, making it well-suited for scenarios where class imbalance is a concern. The high weighted accuracy, in particular, points to the model's strength in handling minority classes effectively, ensuring that all classes contribute to the predictive accuracy proportionally to their presence in the training data.

Visualizing Feature Importances in a Random Forest Classification Model¶

In [133]:
import matplotlib.pyplot as plt
import numpy as np

# Retrieve feature importances from the best model obtained from Bayesian optimization
importances = best_rf_rf.feature_importances_

# Get the indices of the features sorted by importance
indices = np.argsort(importances)[::-1]

# Names of features sorted by importance
sorted_feature_names = [X_cf.columns[i] for i in indices]

# Create a horizontal bar plot to visualize the feature importances
plt.figure(figsize=(12, 6))
plt.title('Feature Importances in RandomForest Classifier')

# Change plt.bar to plt.barh for horizontal bars
plt.barh(range(len(indices)), importances[indices], color='b', align='center')

# Adjust the ticks to align with the y-axis for horizontal bars
plt.yticks(range(len(indices)), sorted_feature_names, rotation=0)

plt.xlabel('Importance')   # X-axis now represents Importance
plt.ylabel('Feature')     # Y-axis now represents Feature
plt.show()
No description has been provided for this image

Classification Metrics Visualization¶

In [134]:
# Confusion matrix and classification report for Random Forest
conf_matrix_rf = confusion_matrix(y_test_rf, y_pred_rf)
classification_rep_rf = classification_report(y_test_rf, y_pred_rf, output_dict=True)

# Extracting the accuracy score from the Random Forest metrics
accuracy_rf = standard_accuracy_rf

# Calling the plot_classification_metrics function with Random Forest outputs
plot_classification_metrics(
    y_pred=y_pred_rf,
    accuracy=accuracy_rf,
    conf_matrix=conf_matrix_rf,
    classification_rep=classification_rep_rf,
    class_labels=class_labels  # Ensure this matches your ordered class labels
)
No description has been provided for this image

Model Performance Analysis¶

Confusion Matrix Insights:¶

  • Very Weak: Most samples correctly classified with few misclassifications into the 'Weak' and 'Normal Strength' categories.
  • Weak: Good performance but some samples confused with 'Very Weak' and 'Normal Strength'.
  • Normal Strength: A noticeable number of samples misclassified as 'High Strength'.
  • High Strength: Excellent performance with most samples correctly classified.
  • Very High Strength: Significant confusion with 'High Strength', indicating possible difficulties in distinguishing between these two classes.

Classification Metrics per Class:¶

  • Precision:
    • Precision is high across all categories, indicating a low number of false positives. This suggests that when a class label is predicted, it is likely to be correct.
  • Recall:
    • The recall varies significantly across classes. 'High Strength' has excellent recall, indicating most of these instances are captured. 'Normal Strength' and 'Very High Strength' have lower recall, suggesting that many instances are missed or misclassified.
  • F1-Score:
    • The F1-score, which balances precision and recall, is generally high, but the variation indicates room for improvement in certain classes, especially 'Normal Strength' and 'Very High Strength'.

Conclusion:¶

The model shows strong precision across all classes but struggles with recall in 'Normal Strength' and 'Very High Strength' categories. The high F1-scores for 'Very Weak', 'Weak', and 'High Strength' suggest robust performance in these classes. Improvement in distinguishing between 'Very High Strength' and 'High Strength' could enhance overall model accuracy.

Logistic Regression Classification Results¶

In [135]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.metrics import (
    accuracy_score, balanced_accuracy_score,
    confusion_matrix, classification_report,
    precision_score, recall_score, f1_score
)
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Define the calculate_weighted_metrics function as above
def calculate_weighted_metrics(y_true, y_pred):
    weighted_precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    weighted_recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
    weighted_f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    
    metrics = {
        'Weighted Precision': weighted_precision,
        'Weighted Recall': weighted_recall,
        'Weighted F1-Score': weighted_f1
    }
    
    return metrics

# Define the model pipeline including preprocessing
model_pipeline = Pipeline([
    ('scaler', MinMaxScaler()),
    ('logreg', LogisticRegression(random_state=42, max_iter=1000))
])

# Define the search space for hyperparameters, including l1_ratio for elasticnet
search_spaces = {
    'logreg__C': Real(1e-6, 1e+6, prior='log-uniform'),
    'logreg__penalty': Categorical(['l2', 'l1', 'elasticnet']),
    'logreg__solver': Categorical(['saga']),  # 'saga' supports 'elasticnet'
    'logreg__l1_ratio': Real(0, 1, prior='uniform'),  # l1_ratio for elasticnet
    'logreg__max_iter': Integer(100, 1000)
}

# Setting up Bayesian optimization with cross-validation
opt = BayesSearchCV(
    model_pipeline, search_spaces, n_iter=30, cv=3, n_jobs=-1, scoring='accuracy', random_state=42
)

# Assuming X_cf and y_cf are your features and labels
X_train, X_test, y_train, y_test = train_test_split(
    X_cf, y_cf, test_size=0.2, random_state=42
)

# Fit the model
opt.fit(X_train, y_train)

# Extracting the best estimator and using it to predict on test data
best_model = opt.best_estimator_
y_pred_lr = best_model.predict(X_test)

# Calculate metrics
metrics_lr = {
    'Standard Accuracy': accuracy_score(y_test, y_pred_lr),
    'Balanced Accuracy': balanced_accuracy_score(y_test, y_pred_lr),
    'Weighted Accuracy': np.average(
        confusion_matrix(y_test, y_pred_lr).diagonal() / confusion_matrix(y_test, y_pred_lr).sum(axis=1),
        weights=compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)[np.unique(y_train).searchsorted(np.unique(y_test))]
    )
}

# Calculate Weighted Precision, Recall, and F1-Score
metrics_weighted = calculate_weighted_metrics(y_test, y_pred_lr)
metrics_lr.update(metrics_weighted)  # Add weighted metrics to the metrics_lr dictionary

# Displaying results
print(f"Best parameters: {opt.best_params_}")
print(f"Standard Accuracy: {metrics_lr['Standard Accuracy']:.4f}")
print(f"Balanced Accuracy: {metrics_lr['Balanced Accuracy']:.4f}")
print(f"Weighted Accuracy: {metrics_lr['Weighted Accuracy']:.4f}")
print(f"Weighted Precision: {metrics_lr['Weighted Precision']:.4f}")
print(f"Weighted Recall: {metrics_lr['Weighted Recall']:.4f}")
print(f"Weighted F1-Score: {metrics_lr['Weighted F1-Score']:.4f}")
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred_lr))
print("\nClassification Report:\n", classification_report(y_test, y_pred_lr))
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_sag.py:349: ConvergenceWarning: The max_iter was reached which means the coef_ did not converge
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l1)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:1197: UserWarning: l1_ratio parameter is only used when penalty is 'elasticnet'. Got (penalty=l2)
  warnings.warn(
Best parameters: OrderedDict([('logreg__C', 11185.625288472094), ('logreg__l1_ratio', 0.8833152773808622), ('logreg__max_iter', 373), ('logreg__penalty', 'elasticnet'), ('logreg__solver', 'saga')])
Standard Accuracy: 0.6410
Balanced Accuracy: 0.7038
Weighted Accuracy: 0.7998
Weighted Precision: 0.6478
Weighted Recall: 0.6410
Weighted F1-Score: 0.6356

Confusion Matrix:
 [[18 12  3  2  1]
 [ 7 23  0  3  6]
 [ 0  0  8  0  0]
 [ 0  0  0 27  5]
 [ 0  9  0  8 24]]

Classification Report:
                     precision    recall  f1-score   support

     High Strength       0.72      0.50      0.59        36
   Normal Strength       0.52      0.59      0.55        39
Very High Strength       0.73      1.00      0.84         8
         Very Weak       0.68      0.84      0.75        32
              Weak       0.67      0.59      0.62        41

          accuracy                           0.64       156
         macro avg       0.66      0.70      0.67       156
      weighted avg       0.65      0.64      0.64       156

Logistic Regression Model Insights¶

Best Hyperparameters:¶

  • C: 23.70 (inverse regularization strength)
  • l1_ratio: 0.109 (L1/L2 regularization mix for elasticnet)
  • max_iter: 911 (maximum number of iterations)
  • penalty: l1 (L1 regularization)
  • solver: saga (solver capable of handling L1 and elasticnet penalties)

Performance Metrics:¶

  • Standard Accuracy: 66.12%
  • Balanced Accuracy: 64.85%
    • Reflects that the model is slightly biased towards the majority classes and struggles with imbalanced data.
  • Weighted Accuracy: 63.86%
    • Shows that the model's performance on the minority classes is lower than desired.

Confusion Matrix Analysis:¶

  • The model performs well for the Very Weak and High Strength classes, with high true positives for these categories.
  • Weak and Normal Strength classes have lower precision and recall, with notable misclassifications into neighboring strength categories.
  • Very High Strength shows moderate precision and recall, but the small sample size (10 instances) makes the model more prone to misclassifications in this class.

Classification Report Highlights:¶

  • Very Weak Class: Highest precision (0.81) and recall (0.88), making it the best-performing class.
  • High Strength Class: Reasonable performance with a precision of 0.59 and recall of 0.71.
  • Weak and Normal Strength Classes: These classes have the lowest precision and recall, suggesting difficulty in correctly classifying these categories.
  • Macro Average: Indicates balanced performance across all classes, but the model struggles with certain categories due to imbalance.
  • Weighted Average: The model shows consistency with a weighted accuracy of 66%, though there is still room for improvement in handling class imbalances.

Convergence Warning:¶

The model did not converge within the specified maximum number of iterations (max_iter=911). This suggests the need for either increasing max_iter or further tuning other hyperparameters such as C or the solver to achieve convergence.

Conclusion:¶

The Logistic Regression model, with optimized hyperparameters, achieves moderate performance with better classification of the Very Weak and High Strength classes. However, improvements are needed for handling class imbalance, especially in the Weak and Normal Strength categories. The non-convergence issue also indicates the need for further fine-tuning.

Plot Classification Metrics for Logistic Regression Model¶

In [136]:
# Assuming the logistic regression model results and the necessary variables are:
# - y_pred_lr: Predictions from the Logistic Regression model
# - metrics_lr: Dictionary containing the accuracy and other metrics
# - class_labels: List of the unique class labels

# Confusion matrix and classification report
conf_matrix_lr = confusion_matrix(y_test, y_pred_lr)
classification_rep_lr = classification_report(y_test, y_pred_lr, output_dict=True)

# Extracting the accuracy score from the logistic regression metrics
accuracy_lr = metrics_lr['Standard Accuracy']

# Calling the plot_classification_metrics function with Logistic Regression outputs
plot_classification_metrics(y_pred_lr, accuracy_lr, conf_matrix_lr, classification_rep_lr, class_labels=class_labels)
No description has been provided for this image

Insights from Classification Metrics¶

  • Confusion Matrix:

    • Strong performance for 'Very Weak' and 'High Strength' categories, but confusion exists between 'Weak' and 'Very Weak', as well as 'High Strength' and 'Very High Strength'.
    • 'Normal Strength' is often misclassified, indicating poor model performance for this class.
  • Precision, Recall, F1-Score:

    • 'Very Weak': High precision and recall (~0.8), indicating good prediction performance.
    • 'Weak': Low recall (~0.5), showing difficulty in identifying this class.
    • 'High Strength': Strong metrics across all scores (>0.8).
    • 'Very High Strength': Slightly lower precision/recall, indicating overlap with 'High Strength'.
    • 'Normal Strength': Very low recall, indicating significant misclassification.
  • Overall:

    • The model performs well for 'Very Weak' and 'High Strength' but struggles with 'Weak' and 'Normal Strength', suggesting the need for further tuning.

Training and Evaluation of an SVM Classifier with MinMax Scaling¶

In [137]:
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, balanced_accuracy_score
from sklearn.utils.class_weight import compute_class_weight

# Split the data to avoid data leakage
X_train, X_test, y_train, y_test = train_test_split(X_cf, y_cf, test_size=0.2, random_state=42)

# Define the model pipeline
svm_pipeline = Pipeline([
    ('scaler', MinMaxScaler()),  # Apply scaling
    ('svc', SVC())
])

# Define the search space for Bayesian Optimization
search_spaces = {
    'svc__C': Real(1e-6, 1e+6, prior='log-uniform'),
    'svc__gamma': Real(1e-6, 1e+1, prior='log-uniform'),
    'svc__kernel': Categorical(['linear', 'rbf', 'poly', 'sigmoid']),
    'svc__degree': Integer(1, 5)  # Only relevant if kernel is 'poly'
}

# Set up Bayesian optimization
bayes_search = BayesSearchCV(
    svm_pipeline,
    search_spaces,
    n_iter=30,  # Number of iterations for optimization
    cv=5,  # Cross-validation folds
    n_jobs=-1,
    scoring='accuracy',
    random_state=42
)

# Fit the model
bayes_search.fit(X_train, y_train)

# Extract the best model
best_svm = bayes_search.best_estimator_

# Make predictions on the test set
y_pred_svm = best_svm.predict(X_test)

# Calculate the accuracy
accuracy_svm = accuracy_score(y_test, y_pred_svm)

# Compute confusion matrix
conf_matrix_svm = confusion_matrix(y_test, y_pred_svm)

# Generate classification report
classification_rep_svm = classification_report(y_test, y_pred_svm, output_dict=True)

# Print the results
print(f"Best parameters: {bayes_search.best_params_}")
print(f"Test Standard Accuracy: {accuracy_svm:.4f}")
print("\nConfusion Matrix:\n", conf_matrix_svm)
print("\nClassification Report:\n", classification_rep_svm)
Best parameters: OrderedDict([('svc__C', 1000000.0), ('svc__degree', 5), ('svc__gamma', 0.06032980889318573), ('svc__kernel', 'rbf')])
Test Standard Accuracy: 0.7051

Confusion Matrix:
 [[16 16  3  0  1]
 [ 6 27  0  1  5]
 [ 1  0  7  0  0]
 [ 0  0  0 29  3]
 [ 0  8  0  2 31]]

Classification Report:
 {'High Strength': {'precision': 0.6956521739130435, 'recall': 0.4444444444444444, 'f1-score': 0.5423728813559322, 'support': 36.0}, 'Normal Strength': {'precision': 0.5294117647058824, 'recall': 0.6923076923076923, 'f1-score': 0.6, 'support': 39.0}, 'Very High Strength': {'precision': 0.7, 'recall': 0.875, 'f1-score': 0.7777777777777778, 'support': 8.0}, 'Very Weak': {'precision': 0.90625, 'recall': 0.90625, 'f1-score': 0.90625, 'support': 32.0}, 'Weak': {'precision': 0.775, 'recall': 0.7560975609756098, 'f1-score': 0.7654320987654321, 'support': 41.0}, 'accuracy': 0.7051282051282052, 'macro avg': {'precision': 0.721262787723785, 'recall': 0.7348199395455494, 'f1-score': 0.7183665515798283, 'support': 156.0}, 'weighted avg': {'precision': 0.718368827464096, 'recall': 0.7051282051282052, 'f1-score': 0.7021177051308877, 'support': 156.0}}
In [138]:
def calculate_weighted_metrics(y_true, y_pred, y_train):
    """
    Calculate Weighted Precision, Weighted Recall, Weighted F1-Score, and Weighted Accuracy.

    Parameters:
    - y_true: Actual target values
    - y_pred: Predicted target values
    - y_train: Training target values (for class weights)

    Returns:
    - metrics_weighted: Dictionary containing weighted metrics
    """
    # Calculate Weighted Precision, Recall, and F1-Score using scikit-learn
    weighted_precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    weighted_recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
    weighted_f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    
    # Compute class weights based on training data
    class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
    class_weights_dict = dict(zip(np.unique(y_train), class_weights))
    
    # Compute confusion matrix
    conf_matrix = confusion_matrix(y_true, y_pred)
    
    # Calculate per-class accuracies
    class_accuracies = conf_matrix.diagonal() / conf_matrix.sum(axis=1)
    
    # Map class weights to the classes present in y_test
    classes_in_test = np.unique(y_true)
    weights_for_test = np.array([class_weights_dict[cls] for cls in classes_in_test])
    
    # Calculate Weighted Accuracy
    weighted_accuracy = np.average(class_accuracies, weights=weights_for_test)
    
    # Store all weighted metrics in a dictionary
    metrics_weighted = {
        'Weighted Accuracy': weighted_accuracy,
        'Weighted Precision': weighted_precision,
        'Weighted Recall': weighted_recall,
        'Weighted F1-Score': weighted_f1
    }
    
    return metrics_weighted
In [139]:
# Assuming you have already run your existing code up to predictions
# y_pred_svm = best_svm.predict(X_test)

# Calculate Weighted Metrics
metrics_weighted_svm = calculate_weighted_metrics(y_test, y_pred_svm, y_train)

# Display the weighted metrics
print(f"Weighted Accuracy: {metrics_weighted_svm['Weighted Accuracy']:.4f}")
print(f"Weighted Precision: {metrics_weighted_svm['Weighted Precision']:.4f}")
print(f"Weighted Recall: {metrics_weighted_svm['Weighted Recall']:.4f}")
print(f"Weighted F1-Score: {metrics_weighted_svm['Weighted F1-Score']:.4f}")
Weighted Accuracy: 0.7728
Weighted Precision: 0.7184
Weighted Recall: 0.7051
Weighted F1-Score: 0.7021

Insights from the SVM Classification Metrics¶

  • Best Parameters:

    • C: 654,785.10
    • Degree: 1 (for polynomial kernel)
    • Gamma: 0.2635
    • Kernel: RBF (Radial Basis Function)
  • Confusion Matrix:

    • Very Weak: Predicted well with only a few misclassifications into the 'Weak' category.
    • Weak: The model shows some confusion between 'Weak' and 'Very Weak' classes.
    • Normal Strength: Predicted fairly well with a recall of ~74%, but some instances are confused with 'Weak'.
    • High Strength: A solid recall of ~66%, though misclassifications are spread across other categories.
    • Very High Strength: The model performs well with an 80% recall but slightly lower precision (72%).
  • Classification Report:

    • Very Weak: High precision (90%) and recall (93.75%), resulting in a strong F1-score of 0.918.
    • Weak: Precision and recall are slightly lower (~75% and 67.5%, respectively), with an F1-score of 0.711.
    • Normal Strength: A reasonable performance with an F1-score of 0.729.
    • High Strength: Struggles a bit, with precision at 67.6% and an F1-score of 0.667.
    • Very High Strength: Achieves a good F1-score of 0.762.
  • Overall Performance:

    • Macro Average F1-Score: 0.757, indicating that the model handles the different classes fairly well.
    • Weighted Average F1-Score: 0.764, suggesting balanced performance across all categories.
    • The model shows room for improvement in predicting the 'Weak' and 'High Strength' categories.

Plot Classification Metrics for Support Vector Machine (SVM) Classifier¶

In [140]:
# Assuming you have the required variables from the SVM model:
# y_pred_svm, accuracy_svm, conf_matrix_svm, classification_rep_svm, and class_labels

# Call the function with the SVM model outputs
plot_classification_metrics(
    y_pred_svm,  # Predictions from the SVM model
    accuracy_svm,  # Accuracy of the SVM model
    conf_matrix_svm,  # Confusion matrix from the SVM model
    classification_rep_svm,  # Classification report from the SVM model
    class_labels=class_labels  # List of class labels
)
No description has been provided for this image

Insights from SVM Classification Metrics¶

  • Confusion Matrix:

    • Very Weak: Performs well with 25 correct predictions but some confusion with 'Weak' (9 misclassifications).
    • Weak: Strong recall with 35 correct predictions but a few misclassifications into 'Very Weak' and 'Very High Strength'.
    • Normal Strength: High precision (8 correct predictions) with no major misclassifications.
    • High Strength: Excellent performance with 45 correct predictions and minimal misclassification (only 3 instances misclassified).
    • Very High Strength: Good performance with 27 correct predictions, though some confusion with 'Weak' and 'High Strength'.
  • Classification Metrics:

    • Very Weak: High precision (~0.9) and recall, indicating strong performance for this class.
    • Weak: Reasonable balance between precision and recall, but slightly lower recall (~0.675).
    • Normal Strength: Performs well in recall (~0.8) and maintains solid precision.
    • High Strength: Excellent across all metrics, with precision and recall both near perfect.
    • Very High Strength: High recall (~0.8) but slightly lower precision, suggesting some difficulty in distinguishing from nearby classes.
  • Overall:

    • Macro Average F1-Score: 0.757, indicating generally strong and balanced performance across all classes.
    • Weighted Average F1-Score: 0.764, showing that the model is relatively well-balanced but can be improved, particularly in the 'Weak' and 'Very High Strength' categories.

KNeighborsClassifier from scikit-learn for Classification Task¶

In [141]:
from sklearn.neighbors import KNeighborsClassifier

Evaluate and Visualize Results for k-Nearest Neighbors (KNN) Classifier¶

In [142]:
from skopt import BayesSearchCV
from skopt.space import Integer, Categorical
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Split the data to avoid data leakage
X_train, X_test, y_train, y_test = train_test_split(X_cf, y_cf, test_size=0.2, random_state=42)

# Define the KNN pipeline with scaling
knn_pipeline = Pipeline([
    ('scaler', MinMaxScaler()),  # Apply scaling
    ('knn', KNeighborsClassifier())
])

# Define the search space for Bayesian Optimization
search_spaces = {
    'knn__n_neighbors': Integer(1, 30),  # Number of neighbors from 1 to 30
    'knn__weights': Categorical(['uniform', 'distance']),  # Weighting strategy for neighbors
    'knn__p': Integer(1, 2)  # p=1 for Manhattan distance, p=2 for Euclidean distance
}

# Set up Bayesian optimization
bayes_search = BayesSearchCV(
    knn_pipeline,
    search_spaces,
    n_iter=30,  # Number of iterations for optimization
    cv=5,  # Cross-validation folds
    n_jobs=-1,
    scoring='accuracy',
    random_state=42
)

# Fit the model
bayes_search.fit(X_train, y_train)

# Extract the best model
best_knn = bayes_search.best_estimator_

# Make predictions on the test set
y_pred_knn = best_knn.predict(X_test)

# Calculate the accuracy
accuracy_knn = accuracy_score(y_test, y_pred_knn)

# Compute confusion matrix
conf_matrix_knn = confusion_matrix(y_test, y_pred_knn)

# Generate classification report
classification_rep_knn = classification_report(y_test, y_pred_knn, output_dict=True)

# Print the results
print(f"Best parameters: {bayes_search.best_params_}")
print(f"Test Standard Accuracy: {accuracy_knn:.4f}")
print("\nConfusion Matrix:\n", conf_matrix_knn)
print("\nClassification Report:\n", classification_rep_knn)
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/joblib/externals/loky/process_executor.py:752: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/skopt/optimizer/optimizer.py:517: UserWarning: The objective has been evaluated at point [6, 1, 'distance'] before, using random point [21, 1, 'distance']
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/skopt/optimizer/optimizer.py:517: UserWarning: The objective has been evaluated at point [6, 1, 'distance'] before, using random point [19, 2, 'distance']
  warnings.warn(
/Users/soheil/anaconda3/envs/tf_m1/lib/python3.10/site-packages/skopt/optimizer/optimizer.py:517: UserWarning: The objective has been evaluated at point [6, 1, 'distance'] before, using random point [14, 1, 'distance']
  warnings.warn(
Best parameters: OrderedDict([('knn__n_neighbors', 5), ('knn__p', 1), ('knn__weights', 'distance')])
Test Standard Accuracy: 0.6795

Confusion Matrix:
 [[18 14  3  1  0]
 [ 3 27  0  1  8]
 [ 2  0  5  0  1]
 [ 0  0  0 30  2]
 [ 0 11  0  4 26]]

Classification Report:
 {'High Strength': {'precision': 0.782608695652174, 'recall': 0.5, 'f1-score': 0.6101694915254238, 'support': 36.0}, 'Normal Strength': {'precision': 0.5192307692307693, 'recall': 0.6923076923076923, 'f1-score': 0.5934065934065934, 'support': 39.0}, 'Very High Strength': {'precision': 0.625, 'recall': 0.625, 'f1-score': 0.625, 'support': 8.0}, 'Very Weak': {'precision': 0.8333333333333334, 'recall': 0.9375, 'f1-score': 0.8823529411764706, 'support': 32.0}, 'Weak': {'precision': 0.7027027027027027, 'recall': 0.6341463414634146, 'f1-score': 0.6666666666666666, 'support': 41.0}, 'accuracy': 0.6794871794871795, 'macro avg': {'precision': 0.6925751001837959, 'recall': 0.6777908067542214, 'f1-score': 0.6755191385550309, 'support': 156.0}, 'weighted avg': {'precision': 0.6980858366727932, 'recall': 0.6794871794871795, 'f1-score': 0.6774204249279024, 'support': 156.0}}
In [143]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Calculate Weighted Precision, Recall, and F1-Score
weighted_precision = precision_score(y_test, y_pred_knn, average='weighted', zero_division=0)
weighted_recall = recall_score(y_test, y_pred_knn, average='weighted', zero_division=0)
weighted_f1 = f1_score(y_test, y_pred_knn, average='weighted', zero_division=0)

# Compute class weights based on training data
classes = np.unique(y_train)
class_weights = compute_class_weight('balanced', classes=classes, y=y_train)
class_weights_dict = dict(zip(classes, class_weights))

# Compute confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred_knn)

# Calculate per-class accuracies
with np.errstate(divide='ignore', invalid='ignore'):
    class_accuracies = conf_matrix.diagonal() / conf_matrix.sum(axis=1)
    class_accuracies = np.nan_to_num(class_accuracies)  # Replace NaN with 0

# Map class weights to the classes present in y_test
classes_in_test = np.unique(y_test)
weights_for_test = np.array([class_weights_dict[cls] for cls in classes_in_test])

# Calculate Weighted Accuracy
weighted_accuracy = np.average(class_accuracies, weights=weights_for_test)

# Print the weighted metrics
print(f"Weighted Accuracy: {weighted_accuracy:.4f}")
print(f"Weighted Precision: {weighted_precision:.4f}")
print(f"Weighted Recall: {weighted_recall:.4f}")
print(f"Weighted F1-Score: {weighted_f1:.4f}")
Weighted Accuracy: 0.6498
Weighted Precision: 0.6981
Weighted Recall: 0.6795
Weighted F1-Score: 0.6774

Insights from KNN Classification Metrics¶

  • Best Parameters:

    • n_neighbors: 7
    • p: 2 (Euclidean distance)
    • weights: 'distance'
  • Confusion Matrix:

    • Very Weak: Strong performance with 26 correct predictions, a few misclassifications into 'Weak'.
    • Weak: Moderate recall (57.5%) with some confusion with 'Very Weak'.
    • Normal Strength: Reasonable performance with 31 correct predictions, some overlap with 'Very Weak' and 'Weak'.
    • High Strength: Decent performance with 44 correct predictions, but some misclassification into 'Very High Strength'.
    • Very High Strength: Struggles to correctly identify this class, with only 3 correct predictions and several misclassifications.
  • Classification Metrics:

    • Very Weak: High precision (0.86) and recall (0.92), resulting in a strong F1-score of 0.889.
    • Weak: Balanced performance with precision of ~0.66 and recall of ~0.58.
    • Normal Strength: Moderate precision (~0.63) and recall (~0.66).
    • High Strength: Lower precision (~0.60) but recall is reasonable (~0.68).
    • Very High Strength: Struggles with only 30% recall and an F1-score of 0.4.
  • Overall Performance:

    • Accuracy: 0.694, indicating reasonable overall performance.
    • Macro Average F1-Score: 0.638, showing that the model could be improved for the smaller classes (e.g., 'Very High Strength').
    • Weighted Average F1-Score: 0.688, suggesting that the model performs relatively well across most classes but can struggle with harder-to-classify categories like 'Very High Strength'.
In [144]:
# Call the function to plot the classification metrics
plot_classification_metrics(y_pred_knn, accuracy_knn, conf_matrix_knn, classification_rep_knn, class_labels)
No description has been provided for this image

Implementation and Evaluation of a Bagging Ensemble with Decision Trees¶

In [145]:
from skopt import BayesSearchCV
from skopt.space import Integer, Real
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Split the data to avoid data leakage
X_train, X_test, y_train, y_test = train_test_split(X_cf, y_cf, test_size=0.2, random_state=42)

# Define the base classifier (Decision Tree) with hyperparameters
base_estimator = DecisionTreeClassifier()

# Define the Bagging pipeline with scaling
bagging_pipeline = Pipeline([
    ('scaler', MinMaxScaler()),  # Apply scaling
    ('bagging', BaggingClassifier(estimator=base_estimator, random_state=42))  # Corrected 'estimator' argument
])

# Define the search space for Bayesian Optimization
search_spaces = {
    'bagging__n_estimators': Integer(10, 100),  # Number of base estimators in the ensemble
    'bagging__max_samples': Real(0.1, 1.0, prior='uniform'),  # Fraction of samples to use
    'bagging__max_features': Real(0.1, 1.0, prior='uniform'),  # Fraction of features to use
    'bagging__bootstrap': [True, False],  # Whether samples are drawn with replacement
    'bagging__bootstrap_features': [True, False],  # Whether features are drawn with replacement
    'bagging__estimator__max_depth': Integer(1, 20),  # Max depth of each Decision Tree
    'bagging__estimator__criterion': ['gini', 'entropy']  # Criterion for splitting nodes
}

# Set up Bayesian optimization
bayes_search = BayesSearchCV(
    bagging_pipeline,
    search_spaces,
    n_iter=30,  # Number of iterations for optimization
    cv=5,  # Cross-validation folds
    n_jobs=-1,
    scoring='accuracy',
    random_state=42
)

# Fit the model
bayes_search.fit(X_train, y_train)

# Extract the best model
best_bagging = bayes_search.best_estimator_

# Make predictions on the test set
y_pred_bagging = best_bagging.predict(X_test)

# Calculate the accuracy
accuracy_bagging = accuracy_score(y_test, y_pred_bagging)

# Compute confusion matrix
conf_matrix_bagging = confusion_matrix(y_test, y_pred_bagging)

# Generate classification report
classification_rep_bagging = classification_report(y_test, y_pred_bagging, output_dict=True)

# Print the results
print(f"Best parameters: {bayes_search.best_params_}")
print(f"Test Standard Accuracy: {accuracy_bagging:.4f}")
print("\nConfusion Matrix:\n", conf_matrix_bagging)
print("\nClassification Report:\n", classification_rep_bagging)
Best parameters: OrderedDict([('bagging__bootstrap', False), ('bagging__bootstrap_features', False), ('bagging__estimator__criterion', 'gini'), ('bagging__estimator__max_depth', 17), ('bagging__max_features', 1.0), ('bagging__max_samples', 0.7068067304321641), ('bagging__n_estimators', 100)])
Test Standard Accuracy: 0.7436

Confusion Matrix:
 [[24 10  2  0  0]
 [ 1 29  0  1  8]
 [ 2  0  6  0  0]
 [ 0  0  0 28  4]
 [ 0  6  0  6 29]]

Classification Report:
 {'High Strength': {'precision': 0.8888888888888888, 'recall': 0.6666666666666666, 'f1-score': 0.7619047619047619, 'support': 36.0}, 'Normal Strength': {'precision': 0.6444444444444445, 'recall': 0.7435897435897436, 'f1-score': 0.6904761904761905, 'support': 39.0}, 'Very High Strength': {'precision': 0.75, 'recall': 0.75, 'f1-score': 0.75, 'support': 8.0}, 'Very Weak': {'precision': 0.8, 'recall': 0.875, 'f1-score': 0.835820895522388, 'support': 32.0}, 'Weak': {'precision': 0.7073170731707317, 'recall': 0.7073170731707317, 'f1-score': 0.7073170731707317, 'support': 41.0}, 'accuracy': 0.7435897435897436, 'macro avg': {'precision': 0.7581300813008129, 'recall': 0.7485146966854284, 'f1-score': 0.7491037842148144, 'support': 156.0}, 'weighted avg': {'precision': 0.7547008547008548, 'recall': 0.7435897435897436, 'f1-score': 0.7442526379093543, 'support': 156.0}}
In [146]:
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
from sklearn.utils.class_weight import compute_class_weight
import numpy as np

def calculate_weighted_metrics(y_true, y_pred, y_train):
    """
    Calculate Weighted Precision, Weighted Recall, Weighted F1-Score, and Weighted Accuracy.

    Parameters:
    - y_true: Actual target values
    - y_pred: Predicted target values
    - y_train: Training target values (for class weights)

    Returns:
    - metrics_weighted: Dictionary containing weighted metrics
    """
    # Calculate Weighted Precision, Recall, and F1-Score
    weighted_precision = precision_score(y_true, y_pred, average='weighted', zero_division=0)
    weighted_recall = recall_score(y_true, y_pred, average='weighted', zero_division=0)
    weighted_f1 = f1_score(y_true, y_pred, average='weighted', zero_division=0)
    
    # Compute class weights based on training data
    classes = np.unique(y_train)
    class_weights = compute_class_weight('balanced', classes=classes, y=y_train)
    class_weights_dict = dict(zip(classes, class_weights))
    
    # Compute confusion matrix
    conf_matrix = confusion_matrix(y_true, y_pred)
    
    # Calculate per-class accuracies
    with np.errstate(divide='ignore', invalid='ignore'):
        class_accuracies = conf_matrix.diagonal() / conf_matrix.sum(axis=1)
        class_accuracies = np.nan_to_num(class_accuracies)  # Replace NaN with 0
    
    # Map class weights to the classes present in y_test
    classes_in_test = np.unique(y_true)
    weights_for_test = np.array([class_weights_dict[cls] for cls in classes_in_test])
    
    # Calculate Weighted Accuracy
    weighted_accuracy = np.average(class_accuracies, weights=weights_for_test)
    
    # Store all weighted metrics in a dictionary
    metrics_weighted = {
        'Weighted Accuracy': weighted_accuracy,
        'Weighted Precision': weighted_precision,
        'Weighted Recall': weighted_recall,
        'Weighted F1-Score': weighted_f1
    }
    
    return metrics_weighted

# Calculate Weighted Metrics
metrics_weighted_bagging = calculate_weighted_metrics(y_test, y_pred_bagging, y_train)

# Display the weighted metrics
print(f"Weighted Accuracy: {metrics_weighted_bagging['Weighted Accuracy']:.4f}")
print(f"Weighted Precision: {metrics_weighted_bagging['Weighted Precision']:.4f}")
print(f"Weighted Recall: {metrics_weighted_bagging['Weighted Recall']:.4f}")
print(f"Weighted F1-Score: {metrics_weighted_bagging['Weighted F1-Score']:.4f}")
Weighted Accuracy: 0.7446
Weighted Precision: 0.7547
Weighted Recall: 0.7436
Weighted F1-Score: 0.7443

Insights from Bagging Classifier (with Decision Tree) Classification Metrics¶

  • Best Parameters:

    • bootstrap: False
    • bootstrap_features: False
    • estimator__criterion: 'gini'
    • estimator__max_depth: 14
    • max_features: 0.838
    • max_samples: 0.677
    • n_estimators: 100
  • Confusion Matrix:

    • Very Weak: Strong performance with 32 correct predictions and minimal misclassifications.
    • Weak: Reasonable recall (70%) but shows confusion with 'Very Weak' and 'Very High Strength' categories.
    • Normal Strength: Moderate performance with 30 correct predictions, but several misclassifications.
    • High Strength: Good recall (~84%) with minimal confusion.
    • Very High Strength: Decent recall (70%) but some overlap with 'Weak' and 'High Strength'.
  • Classification Metrics:

    • Very Weak: High precision (0.93) and recall (0.875), resulting in a strong F1-score of 0.903.
    • Weak: Balanced performance with precision (0.67) and recall (0.70).
    • Normal Strength: Precision and recall are slightly lower, with an F1-score of 0.674.
    • High Strength: Excellent recall (0.84) and solid precision (0.71).
    • Very High Strength: Good F1-score (0.74) but slightly lower recall (0.70).
  • Overall Performance:

    • Accuracy: 0.7596, indicating a solid model performance overall.
    • Macro Average F1-Score: 0.754, reflecting balanced performance across all categories.
    • Weighted Average F1-Score: 0.760, suggesting that the model handles the data well but could improve on misclassifications for 'Weak' and 'Normal Strength' categories.
In [147]:
# Call the function to plot the classification metrics
plot_classification_metrics(
    y_pred_bagging,
    accuracy_bagging,
    conf_matrix_bagging,
    classification_rep_bagging,
    class_labels=['Very Weak', 'Weak', 'Normal Strength', 'High Strength', 'Very High Strength']  # Adjusted class labels
)
No description has been provided for this image

Comparative Performance Analysis of Machine Learning Classifiers on Compressive Strength Prediction¶

Model Ranking for Concrete Compressive Strength Prediction¶

Based on the evaluated performance metrics (accuracy, precision, recall, and F1-score), here is the tentative ranking of the models:

  1. Random Forest Classifier:

    • Strengths: Excellent balance between precision and recall across all classes, especially effective at identifying the 'Very Weak' and 'Very High Strength' classes.
    • Weaknesses: Slight misclassifications between 'Normal Strength', 'High Strength', and 'Weak' classes, which could be improved with further tuning.
  2. SVM Classifier:

    • Strengths: High precision, recall, and F1-scores, indicating robustness in predicting extreme classes.
    • Weaknesses: Struggled with the 'High Strength' class, pointing to limitations in its linear decision boundary.
  3. Logistic Regression Classifier:

    • Strengths: Comparable performance to the SVM, especially strong in predicting 'Very Weak' and 'Very High Strength'.
    • Weaknesses: Weaker performance in distinguishing nuanced differences between 'Normal Strength', 'High Strength', and 'Weak'.
  4. Bagging with Decision Trees:

    • Strengths: High precision and recall, suggesting effective overall performance.
    • Weaknesses: Complete misclassification of the 'High Strength' category, indicative of potential overfitting to majority classes or lacking complexity.
  5. K-Nearest Neighbors (KNN):

    • Strengths: Strong performance in 'Very Weak' and 'Very High Strength' categories.
    • Weaknesses: Significant misclassifications in 'Weak' and 'Normal Strength' categories, likely impacted by the choice of 'k' and the distance metric.

The specific ranking can vary based on the cost of misclassification for each category in practical applications. For instance, if accurately predicting 'Very High Strength' is critical, models that excel in this class would be more valuable. Additionally, each model's performance could be enhanced through hyperparameter tuning, feature engineering, and employing ensemble methods.

Unsupervised Machine Learning¶

Optimizing and Analyzing Concrete Mix Data Clusters with K-Means¶

In [148]:
df2
Out[148]:
Blast Furnace Slag Fly Ash Superplasticizer Coarse Aggregate Fine Aggregate Age Water_Cement_Ratio Coarse_Fine_Ratio Strength Strength_Category
1 0.0 0.0 2.5 1055.0 676.0 28 0.300000 1.560651 61.89 Very High Strength
8 114.0 0.0 0.0 932.0 670.0 28 0.857143 1.391045 45.85 High Strength
11 132.4 0.0 0.0 978.4 825.5 28 0.966767 1.185221 28.02 Weak
14 76.0 0.0 0.0 932.0 670.0 28 0.750000 1.391045 47.81 High Strength
21 209.4 0.0 0.0 1047.0 806.9 28 1.375358 1.297559 28.24 Weak
... ... ... ... ... ... ... ... ... ... ...
1025 116.0 90.3 8.9 870.1 768.3 28 0.649783 1.132500 44.28 High Strength
1026 0.0 115.6 10.4 817.9 813.4 28 0.608318 1.005532 31.18 Normal Strength
1027 139.4 108.6 6.1 892.4 780.0 28 1.297643 1.144103 23.70 Weak
1028 186.7 0.0 11.3 989.6 788.9 28 1.103708 1.254405 32.77 Normal Strength
1029 100.5 78.3 8.6 864.5 761.5 28 0.768877 1.135259 32.40 Normal Strength

776 rows × 10 columns

In [149]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from yellowbrick.cluster import KElbowVisualizer
from sklearn.cluster import KMeans

# Specify the column names to be used in clustering
column_names = ['Blast Furnace Slag', 'Fly Ash', 'Superplasticizer', 'Age', 
                'Water_Cement_Ratio', 'Coarse_Fine_Ratio']

# Select columns from the DataFrame for clustering
X = df2[column_names]

# Scale the data
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Initialize the KMeans model with a specific number of initializations
kmeans_model = KMeans(n_init=10)

# Use an Elbow visualizer to find the optimal number of clusters
visualizer = KElbowVisualizer(kmeans_model, k=(1, 10))
visualizer.fit(X_scaled)  # Fit the scaled data
visualizer.show()  # Display the elbow plot

# Retrieve the optimal number of clusters detected by the elbow method
opt_clusters = visualizer.elbow_value_

# Retrain the KMeans model with the optimal number of clusters
kmeans = KMeans(n_clusters=opt_clusters, random_state=0).fit(X_scaled)

# Assign the cluster labels to a new column in the original dataframe
df2["KMeans_Cluster"] = kmeans.labels_

# Display the first few rows of the dataframe to verify the clustering
print(df2[["KMeans_Cluster"] + column_names].head())
No description has been provided for this image
    KMeans_Cluster  Blast Furnace Slag  Fly Ash  Superplasticizer  Age  \
1                0                 0.0      0.0               2.5   28   
8                1               114.0      0.0               0.0   28   
11               1               132.4      0.0               0.0   28   
14               0                76.0      0.0               0.0   28   
21               1               209.4      0.0               0.0   28   

    Water_Cement_Ratio  Coarse_Fine_Ratio  
1             0.300000           1.560651  
8             0.857143           1.391045  
11            0.966767           1.185221  
14            0.750000           1.391045  
21            1.375358           1.297559  

Clustering Insights from KMeans Model¶

Overview¶

The KMeans clustering algorithm was applied to a set of construction material data to identify patterns based on the attributes such as Blast Furnace Slag, Fly Ash, Superplasticizer, Age, Water-Cement Ratio, and Coarse-Fine Ratio. The process involved scaling the data and using the elbow method to determine the optimal number of clusters.

Optimal Number of Clusters¶

The elbow plot provides a clear indication that the optimal number of clusters for this dataset is 4. This conclusion is based on the elbow method, which identifies a point where the decrease in distortion score (sum of squared distances from points to their cluster center) becomes less steep, suggesting that additional clusters beyond this point do not significantly enhance the model's ability to explain variance in the data.

Cluster Characteristics¶

  • Cluster 0 appears to consist of samples with no Blast Furnace Slag and varying levels of Water-Cement Ratio and Coarse-Fine Ratio.
  • Cluster 1 is characterized by a significant presence of Blast Furnace Slag (114.0 units) and an absence of Superplasticizer and Fly Ash.
  • Cluster 2 and Cluster 3 characteristics are not immediately clear from the initial rows displayed, suggesting further analysis is needed to understand their defining properties.

Interpretation¶

The clustering has revealed distinct groups within the data that could correspond to different types of concrete formulations, particularly in relation to the use of specific materials like Blast Furnace Slag and Superplasticizer. The Age of the concrete samples does not show significant variation across the clusters in the initial data shown, suggesting that the material composition, rather than curing time, is a more defining feature of the clusters.

Further Analysis¶

  • Cluster Distribution: Additional analysis can include a deeper examination of the distribution of other variables within each cluster to better understand their characteristics.
  • Cluster Validation: Assessing the stability of these clusters with other methods such as silhouette scores could validate the appropriateness of the four-cluster solution.
  • Practical Application: Understanding these clusters can help in categorizing different concrete types and potentially in predicting their performance characteristics.

These insights help in understanding the underlying patterns in concrete formulations and can guide further data-driven decision-making in construction material management.

Visualizing K-Means Clustering with Pair Plot¶

In [150]:
import seaborn as sns
import matplotlib.pyplot as plt

# Select only the columns needed for plotting along with the cluster labels
df_plot = df2[column_names + ["KMeans_Cluster"]]

# Create a pair plot (scatter matrix) with hue based on cluster labels
pair_plot = sns.pairplot(df_plot, hue="KMeans_Cluster", palette='rainbow')

# Show the plot
plt.show()
No description has been provided for this image

E xploring Alignment of Clusters with Concrete Strength¶

In [151]:
import seaborn as sns
import matplotlib.pyplot as plt
import warnings

warnings.simplefilter(action='ignore', category=FutureWarning)  # Suppress FutureWarnings

# Create a box plot to visualize clusters vs. concrete strength
plt.figure(figsize=(10, 6))
sns.boxplot(x="KMeans_Cluster", y="Strength", data=df2, palette='rainbow')
plt.title("Clusters vs. Strength")
plt.xlabel("Clusters")
plt.ylabel("Strength")
plt.show()
No description has been provided for this image

Calculating Cluster Statistics for Concrete Strength¶

In [152]:
# Group data by cluster labels
cluster_stats = df2.groupby("KMeans_Cluster")["Strength"].agg(['mean', 'median', 'std', 'count'])

# Display the cluster statistics
print(cluster_stats)
                     mean  median        std  count
KMeans_Cluster                                     
0               27.883313  24.800  15.067850    166
1               33.805545  32.825  18.431228    220
2               32.856359  32.245  13.787231    390

Hypothesis Testing for Clustering Alignment with Concrete Strength¶

In [153]:
from scipy import stats


# Group data by cluster labels
grouped_data = [df2[df2["KMeans_Cluster"] == cluster]["Strength"] for cluster in df2["KMeans_Cluster"].unique()]

# Perform one-way ANOVA
f_statistic, p_value = stats.f_oneway(*grouped_data)

# Check if the p-value is less than the significance level (e.g., 0.05)
alpha = 0.05
if p_value < alpha:
    print("Reject the null hypothesis: There are significant differences between clusters.")
else:
    print("Fail to reject the null hypothesis: No significant differences between clusters.")
Reject the null hypothesis: There are significant differences between clusters.

DBSCAN Clustering Analysis of Concrete Dataset¶

In [154]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns  # Ensure seaborn is imported
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

# Select the columns for clustering
column_names = ['Blast Furnace Slag', 'Fly Ash', 'Superplasticizer', 'Age', 
                'Water_Cement_Ratio', 'Coarse_Fine_Ratio']
X = df2[column_names]

# Scale the data
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Create a DBSCAN model
dbscan = DBSCAN(eps=1, min_samples=5)  # eps and min_samples could be adjusted as needed

# Fit the model to the scaled data
dbscan.fit(X_scaled)

# Get the cluster labels (-1 represents noise/outliers)
cluster_labels = dbscan.labels_

# Adding the cluster labels to the DataFrame
df2['DBSCAN_Cluster'] = cluster_labels

# Create a pair plot (scatter matrix) with hue based on cluster labels
df_plot = df2[column_names + ["DBSCAN_Cluster"]]  # Ensure dataframe for plotting is defined correctly
sns.pairplot(df_plot, hue="DBSCAN_Cluster", palette='rainbow', plot_kws={'alpha': 0.5})  # Adjust transparency with alpha

# Show the plot
plt.show()
No description has been provided for this image
In [155]:
import numpy as np

# Exclude noise by filtering out -1, then find the unique clusters
unique_clusters = np.unique(cluster_labels[cluster_labels != -1])

# Count the number of unique clusters
num_clusters = len(unique_clusters)

print(f"Number of clusters found by DBSCAN: {num_clusters}")
Number of clusters found by DBSCAN: 1

Counting Clusters and Data Points in DBSCAN Analysis¶

In [156]:
# Count the number of unique cluster labels (excluding noise points)
unique_clusters = set(cluster_labels)
num_clusters = len(unique_clusters) - (1 if -1 in unique_clusters else 0)

# Count the number of data points in each cluster
cluster_counts = {cluster: list(cluster_labels).count(cluster) for cluster in unique_clusters}

# Print the results
print(f"Number of clusters: {num_clusters}")
print("Number of data points in each cluster:")
for cluster, count in cluster_counts.items():
    if cluster == -1:
        print(f"  Noise/Outliers: {count} data points")
    else:
        print(f"  Cluster {cluster}: {count} data points")
Number of clusters: 1
Number of data points in each cluster:
  Cluster 0: 776 data points

Hierarchical Clustering Analysis for Concrete Strength Dataset¶

In [157]:
import scipy.cluster.hierarchy as sch
import matplotlib.pyplot as plt

# Assuming X_scaled is already defined and contains the scaled feature data
# Create a dendrogram using the 'ward' linkage method
dendrogram = sch.dendrogram(sch.linkage(X_scaled, method='ward'))

plt.title('Dendrogram for Determining Optimal Clusters')
plt.xlabel('Samples')
plt.ylabel('Distance')
plt.show()
No description has been provided for this image
In [158]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import AgglomerativeClustering
import scipy.cluster.hierarchy as sch

# Select the columns for clustering
column_names = ['Blast Furnace Slag', 'Fly Ash', 'Superplasticizer', 'Age', 
                'Water_Cement_Ratio', 'Coarse_Fine_Ratio']
X = df2[column_names]

# Scale the data
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

# Step 2: Performing Hierarchical Clustering
n_clusters = 5  # Adjust as needed
hierarchical_clustering = AgglomerativeClustering(n_clusters=n_clusters, linkage='ward')
cluster_labels = hierarchical_clustering.fit_predict(X_scaled)

# Step 3: Visualizing the Clusters
# Create a dendrogram to visualize the hierarchical structure
plt.figure(figsize=(10, 7))  # Specify the figure size
dendrogram = sch.dendrogram(sch.linkage(X_scaled, method='ward'))
plt.title('Dendrogram')
plt.xlabel('Data Points')
plt.ylabel('Euclidean Distances')
plt.show()

# Step 4: Interpreting the Results
# Add cluster labels to the DataFrame
df2['Hierarchical_Cluster'] = cluster_labels

# Select only the numeric columns for aggregation
numeric_columns = df2.select_dtypes(include=[np.number]).columns

# Group by clusters and compute the mean of numeric columns only
cluster_means = df2.groupby('Hierarchical_Cluster')[numeric_columns].mean()
print(cluster_means)

# Visualize the clusters using pair plots
sns.pairplot(df2, vars=column_names, hue='Hierarchical_Cluster', palette='rainbow')
plt.show()
No description has been provided for this image
                      Blast Furnace Slag     Fly Ash  Superplasticizer  \
Hierarchical_Cluster                                                     
0                              46.532523  123.948024          9.151368   
1                             165.588696    2.556522         10.428696   
2                               0.303030    0.721212          0.623030   
3                              18.266129  120.995161          8.430645   
4                             190.066667    0.000000          0.220000   

                      Coarse Aggregate  Fine Aggregate        Age  \
Hierarchical_Cluster                                                
0                           956.037386      779.223708  20.471125   
1                           945.011304      775.306957  24.608696   
2                          1023.072121      774.484242  17.309091   
3                           996.091935      801.298387  56.000000   
4                           973.188571      765.492381  15.800000   

                      Water_Cement_Ratio  Coarse_Fine_Ratio   Strength  \
Hierarchical_Cluster                                                     
0                               0.828966           1.234938  30.251581   
1                               0.640195           1.234412  45.795913   
2                               0.553363           1.338827  27.762545   
3                               0.715041           1.248971  46.583548   
4                               1.109634           1.286877  20.871714   

                      KMeans_Cluster  DBSCAN_Cluster  Hierarchical_Cluster  
Hierarchical_Cluster                                                        
0                           1.996960             0.0                   0.0  
1                           1.000000             0.0                   1.0  
2                           0.000000             0.0                   2.0  
3                           2.000000             0.0                   3.0  
4                           0.990476             0.0                   4.0  
No description has been provided for this image

Exploratory Analysis with Hierarchical Clustering for Concrete Strength Dataset¶

In [159]:
import numpy as np

# Count the number of unique cluster labels
unique_clusters = np.unique(cluster_labels)
num_clusters = len(unique_clusters)

# Count the number of data points in each cluster
cluster_counts = {cluster: np.sum(cluster_labels == cluster) for cluster in unique_clusters}

# Print the results
print(f"Number of clusters: {num_clusters}")
print("Number of data points in each cluster:")
for cluster, count in cluster_counts.items():
    print(f"  Cluster {cluster}: {count} data points")
Number of clusters: 5
Number of data points in each cluster:
  Cluster 0: 329 data points
  Cluster 1: 115 data points
  Cluster 2: 165 data points
  Cluster 3: 62 data points
  Cluster 4: 105 data points

Assessing Variations in Concrete Strength Across Clusters Using ANOVA¶

In [160]:
from scipy.stats import f_oneway

# Select the concrete compressive strength and cluster labels
concrete_strength = df2['Strength']
cluster_labels = df2['Hierarchical_Cluster']

# Dynamically gather concrete strength data for each cluster
unique_clusters = np.unique(cluster_labels)
cluster_groups = [concrete_strength[cluster_labels == cluster] for cluster in unique_clusters]

# Perform one-way ANOVA to test if there are significant differences in concrete strength
# across the clusters
statistic, p_value = f_oneway(*cluster_groups)

# Set the significance level (alpha)
alpha = 0.05

# Check if the p-value is less than alpha to determine significance
if p_value < alpha:
    print("Hypothesis Testing Result: There are significant differences in concrete strength across the clusters.")
else:
    print("Hypothesis Testing Result: There are no significant differences in concrete strength across the clusters.")
Hypothesis Testing Result: There are significant differences in concrete strength across the clusters.

Final Overview of the Dataset with Clustering Labels¶

In [161]:
df2.head()
Out[161]:
Blast Furnace Slag Fly Ash Superplasticizer Coarse Aggregate Fine Aggregate Age Water_Cement_Ratio Coarse_Fine_Ratio Strength Strength_Category KMeans_Cluster DBSCAN_Cluster Hierarchical_Cluster
1 0.0 0.0 2.5 1055.0 676.0 28 0.300000 1.560651 61.89 Very High Strength 0 0 2
8 114.0 0.0 0.0 932.0 670.0 28 0.857143 1.391045 45.85 High Strength 1 0 4
11 132.4 0.0 0.0 978.4 825.5 28 0.966767 1.185221 28.02 Weak 1 0 4
14 76.0 0.0 0.0 932.0 670.0 28 0.750000 1.391045 47.81 High Strength 0 0 4
21 209.4 0.0 0.0 1047.0 806.9 28 1.375358 1.297559 28.24 Weak 1 0 4

External Dataset¶

Summarization and interpretations of findings¶

key findings and interpretations:

Exploratory Data Analysis

  • Cement content has a strong positive correlation with concrete compressive strength. As cement increases, strength tends to increase.
  • Blast furnace slag and fly ash are used as cement replacements and have - negative correlations with cement content.
  • Water content is reduced by using superplasticizers while maintaining workability.
  • Superplasticizers have a strong positive correlation with concrete strength.
  • Coarse and fine aggregates show little correlation with strength in this dataset.
  • Concrete strength generally increases over time as curing continues.

Model Evaluation

  • Random Forest Regression provides the most accurate predictions of concrete strength with the lowest error (MSE=19.87).
  • AdaBoost, Gradient Boosting, and KNN regressors also perform well. Linear models tend to have higher error.
  • For classification, Random Forest Classifier has the best balance of precision and recall across strength classes (accuracy 75.41%).

Clustering Analysis

  • K-means clustering revealed 4 distinct groups based on the elbow criterion. Cluster statistics show variation in average strength.
  • ANOVA results indicate significant differences in strength across K-means clusters.
  • Hierarchical clustering forms intuitive groups of data points with more gradation. Statistics also vary across clusters.
  • Both clustering approaches provide ways to explore and analyze patterns in the underlying dataset.

Suggestions and Recommendations¶

Based on the analysis conducted, here are some suggestions and recommendations to help improve concrete compressive strength prediction and optimization:

  1. Leverage advanced machine learning models like Random Forest Regression as the primary prediction algorithm due to their higher accuracy over linear regression techniques. Continued hyperparameter tuning could further improve performance.
  2. Incorporate broader information about curing conditions and construction methods into the data collection and modeling, as the current analysis is limited to ingredients only. This can capture more factors that influence strength.
  3. Apply clustering analysis during the mix design process to guide the selection of ingredient proportions tailored to achieving target strength profiles. The composition could be modeled around cluster centroids.
  4. Establish feedback loops to incorporate measured concrete strength from past mixes into the training data for predictive models. This allows the system to continually learn and improve over time.
  5. Develop user-friendly interfaces and workflows to enable non-experts to effectively utilize the machine learning capabilities for strength prediction and mix optimization. Simplify model deployment.
  6. Explore edge cases and outlier points found during clustering to deepen understanding of scenarios that significantly influence concrete strength, either positively or negatively. Identify boundary operating conditions.
  7. Compute estimated economic impact and sustainability metrics for proposed changes to concrete mix designs guided by machine learning recommendations to quantify potential benefits.

Project Impact and Stakeholder Benefits¶

This project, which optimizes the design of mixes using machine learning has implications and benefits, for various parties involved in the construction industry;

Contractors and Builders;

  • Optimized concrete mixes reduce material costs by improving efficiency.
  • Ensure achievement of target strength and quality requirements.
  • Speed up construction schedules and minimize delays.

Engineers;

  • Customize properties to expand possibilities in structural design.
  • Incorporate predictive analytics in the design process.
  • Enhance estimations of resilience for structures.

Scientists and Researchers;

  • Opportunity to apply AI solutions with real world impact.
  • Drive the discovery of materials and ingredients.
  • Validate models and research findings.

Policymakers;

  • standardization and consistency in quality across the industry.
  • Develop regulations based on performance leveraging data driven insights.
  • Evaluate impact through lifecycle analysis.

Equipment and Material Suppliers;

  • in time inventory management practices.
  • Improve manufacturing capabilities through data driven usage analysis.

Overall these predictive capabilities and process improvements will lead to cost savings, innovative building designs, faster construction, reduced waste and sustainability benefits for stakeholders. Collaboration across disciplines from the stages can guide tool development towards impact throughout project lifecycles. This collaboration is crucial, to realizing its potential while managing any associated risks.

Some real world applications may focus on designing specialized concretes for saltwater environments, high traffic structures or 3D printed buildings.Once the foundational predictive framework is established the potential applications become vast and wide ranging.

References¶

  • Ahmad, W., Ahmad, A., Ostrowski, K., Aslam, F., Joyklad, P., & Zajdel, P. (2021). Application of advanced machine learning approaches to predict the compressive strength of concrete containing supplementary cementitious materials. Materials, 14(19), 5762. https://doi.org/10.3390/ma14195762

  • Al-Hashem, M., Amin, M., Ahmad, W., Khan, K., Ahmad, A., Ehsan, S., … & Qadir, M. (2022). Data-driven techniques for evaluating the mechanical strength and raw material effects of steel fiber-reinforced concrete. Materials, 15(19), 6928. https://doi.org/10.3390/ma15196928

  • Khademi, F., Jamal, S. M., Deshpande, N., & Londhe, S. (2016). Predicting strength of recycled aggregate concrete using artificial neural network, adaptive neuro-fuzzy inference system and multiple linear regression. International Journal of Sustainable Built Environment, 5(2), 355-369.https://doi.org/10.1016/j.ijsbe.2016.09.003

  • Nawy, E. (2008). Concrete construction engineering handbook.. https://doi.org/10.1201/9781420007657

  • Siddique, R., Aggarwal, P., & Aggarwal, Y. (2011). Prediction of compressive strength of self-compacting concrete containing bottom ash using artificial neural networks. Advances in engineering software, 42(10), 780-786. https://doi.org/10.1016/j.advengsoft.2011.05.016